pith. machine review for the scientific record. sign in

arxiv: 2503.21755 · v2 · submitted 2025-03-27 · 💻 cs.CV

Recognition: 3 theorem links

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generationbenchmarkintrinsic faithfulnessphysics evaluationcommonsense reasoninghuman fidelitycontrollabilityanomaly detection
0
0 comments X

The pith

VBench-2.0 introduces a benchmark that tests video generation models for intrinsic faithfulness to physical laws, human anatomy, and commonsense.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that prior video benchmarks only check superficial qualities such as visual appeal and basic prompt adherence. It introduces VBench-2.0 to measure deeper intrinsic faithfulness across five dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense. Each dimension includes fine-grained capabilities scored automatically by vision-language models, language models, and anomaly detection methods. Human annotations validate that these scores align with human judgment. The work targets the next step toward video models that function as reliable world simulators rather than merely plausible image sequences.

Core claim

VBench-2.0 is a benchmark suite designed to automatically evaluate video generative models for intrinsic faithfulness, meaning adherence to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. It organizes evaluation into five dimensions—Human Fidelity, Controllability, Creativity, Physics, and Commonsense—each subdivided into targeted capabilities. The framework applies tailored combinations of state-of-the-art vision-language models, large language models, and specialist anomaly detection techniques, all cross-checked against extensive human annotations.

What carries the argument

VBench-2.0 framework that integrates generalist VLMs and LLMs with video-specific anomaly detection methods to score fine-grained capabilities within each faithfulness dimension.

If this is right

  • Models achieving high VBench-2.0 scores should support more reliable AI-assisted filmmaking and simulated world modeling.
  • The five-dimension breakdown enables targeted model improvements in areas such as physics adherence or human anatomical accuracy.
  • Automatic metrics validated by human annotations can serve as scalable proxies for ongoing model development.
  • Progress on intrinsic faithfulness metrics marks a shift from visually coherent outputs to fundamentally realistic video generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread adoption could steer video model training toward explicit rule-violation penalties rather than only aesthetic rewards.
  • The same dimension structure might transfer to benchmarks for other generative domains such as 3D scene synthesis or interactive simulation.
  • Integration with embodied AI systems could use VBench-2.0 scores to predict how well generated videos translate into accurate planning or control signals.

Load-bearing premise

The combination of current top vision-language models, language models, and anomaly detectors can detect violations of physics and commonsense rules without missing subtle failures or introducing new evaluation biases.

What would settle it

A test set of generated videos where human raters consistently flag clear physics or commonsense violations that the automated VBench-2.0 scores rate as acceptable, or where the benchmark flags problems that humans accept.

read the original abstract

Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench have been developed to assess their faithfulness, measuring factors like per-frame aesthetics, temporal consistency, and basic prompt adherence. However, these aspects mainly represent superficial faithfulness, which focus on whether the video appears visually convincing rather than whether it adheres to real-world principles. While recent models perform increasingly well on these metrics, they still struggle to generate videos that are not just visually plausible but fundamentally realistic. To achieve real "world models" through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. Achieving this level of realism is essential for applications such as AI-assisted filmmaking and simulated world modeling. To bridge this gap, we introduce VBench-2.0, a next-generation benchmark designed to automatically evaluate video generative models for their intrinsic faithfulness. VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, each further broken down into fine-grained capabilities. Tailored to individual dimensions, our evaluation framework integrates generalists such as SOTA VLMs and LLMs, and specialists, including anomaly detection methods proposed for video generation. We conduct extensive human annotations to ensure evaluation alignment with human judgment. By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench-2.0 aims to set a new standard for the next generation of video generative models in pursuit of intrinsic faithfulness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces VBench-2.0, a next-generation benchmark for video generative models that shifts evaluation from superficial faithfulness (aesthetics, temporal consistency, prompt adherence) to intrinsic faithfulness. It defines five dimensions—Human Fidelity, Controllability, Creativity, Physics, and Commonsense—each decomposed into fine-grained capabilities, and proposes an automated evaluation pipeline that combines SOTA VLMs, LLMs, and anomaly detectors, with alignment ensured via extensive human annotations.

Significance. If the automated evaluators prove reliable, VBench-2.0 would provide a scalable, reproducible standard that pushes video generation research toward genuine world-modeling capabilities rather than visually convincing but physically implausible outputs. The explicit human-validation step and dimension-specific specialist modules are concrete strengths that could accelerate progress in AI-assisted filmmaking and simulated environments.

major comments (2)
  1. [§4.3] §4.3 (Physics dimension evaluation): the claim that the anomaly-detection + VLM pipeline reliably identifies subtle violations (e.g., incorrect object trajectories under gravity or implausible causal sequences) rests on correlation with human raters on a validation subset; this does not demonstrate coverage of edge cases where the same VLMs exhibit known reasoning failures, leaving the central reliability claim under-supported.
  2. [Table 5] Table 5 (human-automated alignment): the reported Pearson correlations for the Commonsense and Physics dimensions are computed on prompts that appear coarse-grained; without an explicit stress-test on fine-grained violation prompts, it is unclear whether high alignment on the validation set generalizes to the full benchmark distribution.
minor comments (3)
  1. [§2.1] §2.1: the distinction between “superficial” and “intrinsic” faithfulness is introduced without a formal definition or reference to prior literature on physical commonsense in video; a short clarifying paragraph would improve precision.
  2. [Figure 3] Figure 3: axis labels on the radar plots for the five dimensions are too small to read in print; increasing font size or adding a legend table would aid clarity.
  3. [§5.2] §5.2: several citations to the original VBench paper are given only by name without year or arXiv identifier; adding full references would help readers locate the baseline metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below, explaining our response and the revisions we will incorporate.

read point-by-point responses
  1. Referee: [§4.3] §4.3 (Physics dimension evaluation): the claim that the anomaly-detection + VLM pipeline reliably identifies subtle violations (e.g., incorrect object trajectories under gravity or implausible causal sequences) rests on correlation with human raters on a validation subset; this does not demonstrate coverage of edge cases where the same VLMs exhibit known reasoning failures, leaving the central reliability claim under-supported.

    Authors: We appreciate the referee's point that correlation on a validation subset does not automatically guarantee coverage of all VLM reasoning failure modes in edge cases. Our human annotation protocol was designed to include diverse physical violation scenarios, and the specialist anomaly detectors were introduced precisely to compensate for known VLM limitations in causal and trajectory reasoning. In the revision we will add an explicit limitations paragraph in §4.3 together with qualitative examples of edge cases where the combined pipeline succeeds or fails, thereby making the reliability argument more transparent. This is a partial revision because we build on the existing human-validated data rather than collecting new annotations. revision: partial

  2. Referee: [Table 5] Table 5 (human-automated alignment): the reported Pearson correlations for the Commonsense and Physics dimensions are computed on prompts that appear coarse-grained; without an explicit stress-test on fine-grained violation prompts, it is unclear whether high alignment on the validation set generalizes to the full benchmark distribution.

    Authors: We acknowledge that the prompts used for the reported correlations may appear predominantly coarse-grained. To demonstrate generalization, we will add a new appendix section containing a stress-test on a curated set of fine-grained violation prompts (drawn from the same human-annotation pool) and will report the updated Pearson correlations for both dimensions. This addition directly addresses the concern and will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark relies on external VLMs/LLMs with human validation

full rationale

The paper introduces VBench-2.0 as an evaluation framework for video generation models across five dimensions (Human Fidelity, Controllability, Creativity, Physics, Commonsense). It integrates pre-existing SOTA VLMs, LLMs, and anomaly detectors, then aligns them via human annotations. No mathematical derivations, fitted parameters, or predictions appear in the provided text. The reference to prior 'VBench' work is a minor contextual citation and not load-bearing for the central claims, which consist of new dimension definitions and an external evaluation pipeline. No self-definitional loops, fitted-input predictions, or ansatz smuggling are present. The construction is self-contained against external benchmarks and human validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work introduces no new mathematical derivations, free parameters, or invented physical entities; it relies on existing pre-trained VLMs and LLMs plus human annotation protocols as evaluation tools.

pith-pipeline@v0.9.0 · 5626 in / 1110 out tokens · 32570 ms · 2026-05-14T18:37:01.650654+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • LawOfExistence defect_zero_iff_one echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    To achieve real “world models” through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PhysInOne: Visual Physics Learning and Reasoning in One Suite

    cs.CV 2026-04 unverdicted novelty 8.0

    PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...

  2. EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

    cs.CV 2026-05 conditional novelty 7.0

    EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.

  3. KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

    cs.CV 2026-05 unverdicted novelty 7.0

    KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.

  4. PhyGround: Benchmarking Physical Reasoning in Generative World Models

    cs.CV 2026-05 accept novelty 7.0

    PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.

  5. WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

    cs.CV 2026-05 unverdicted novelty 7.0

    WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.

  6. Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

    cs.CV 2026-04 unverdicted novelty 7.0

    Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...

  7. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.

  8. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.

  9. AnimationBench: Are Video Models Good at Character-Centric Animation?

    cs.CV 2026-04 unverdicted novelty 7.0

    AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.

  10. Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

    cs.CV 2026-04 unverdicted novelty 7.0

    Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...

  11. Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.

  12. PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

    cs.CV 2026-05 conditional novelty 6.0

    PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than pr...

  13. WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

    cs.CV 2026-05 unverdicted novelty 6.0

    The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.

  14. SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.

  15. WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

    cs.CV 2026-05 conditional novelty 6.0

    WorldJen is a multi-dimensional video generation benchmark using VLM-graded Likert questionnaires on joint prompts, validated to match human three-tier rankings.

  16. HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation

    cs.CV 2026-04 unverdicted novelty 6.0

    HuM-Eval evaluates human motion videos with a coarse-to-fine approach using VLM global checks plus 2D pose and 3D motion analysis, reaching 58.2% average correlation with human judgments and introducing a 1000-prompt ...

  17. Seeing Fast and Slow: Learning the Flow of Time in Videos

    cs.CV 2026-04 unverdicted novelty 6.0

    Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.

  18. Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation

    cs.CV 2026-04 unverdicted novelty 6.0

    Long-CODE isolates long-context video evaluation with a new benchmark dataset and shot-dynamics metric that correlates better with human judgments on narrative richness and global consistency than short-video metrics.

  19. VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation

    cs.CV 2026-04 unverdicted novelty 6.0

    VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.

  20. ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

    cs.CV 2026-04 unverdicted novelty 6.0

    ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.

  21. Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 5.0

    Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.

  22. LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)

    cs.CV 2026-05 conditional novelty 5.0

    The PhyScore challenge creates the first benchmark requiring metrics to jointly score video quality, physical realism, condition alignment, and temporal consistency while localizing physical anomalies in 1554 videos f...

  23. Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

    cs.CV 2026-04 unverdicted novelty 5.0

    Phantom generates visually realistic and physically consistent videos by jointly modeling visual content and latent physical dynamics via an abstract physics-aware representation.

  24. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  25. Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

    cs.CV 2026-05 unverdicted novelty 4.0

    Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

  26. Evolution of Video Generative Foundations

    cs.CV 2026-04 unverdicted novelty 2.0

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · cited by 24 Pith papers · 17 internal anchors

  1. [1]

    Magicedit: High-fidelity and temporally coherent video editing,

    J. H. Liew, H. Yan, J. Zhang, Z. Xu, and J. Feng, “Magicedit: High-fidelity and temporally coherent video editing,” arXiv preprint arXiv:2308.14749, 2023

  2. [2]

    Stablevideo: Text- driven consistency-aware diffusion video editing,

    W. Chai, X. Guo, G. Wang, and Y . Lu, “Stablevideo: Text- driven consistency-aware diffusion video editing,” arXiv preprint arXiv:2308.09592, 2023

  3. [3]

    Tokenflow: Con- sistent diffusion features for consistent video editing,

    M. Geyer, O. Bar-Tal, S. Bagon, and T. Dekel, “Tokenflow: Con- sistent diffusion features for consistent video editing,” arXiv preprint arxiv:2307.10373, 2023

  4. [4]

    Inve: Interactive neural video editing,

    J. Huang, L. Sigal, K. M. Yi, O. Wang, and J.-Y . Lee, “Inve: Interactive neural video editing,” arXiv preprint arXiv:2307.07663 , 2023

  5. [5]

    Videdit: Zero-shot and spatially aware text-driven video editing,

    P. Couairon, C. Rambour, J.-E. Haugeard, and N. Thome, “Videdit: Zero-shot and spatially aware text-driven video editing,” arXiv preprint arXiv:2306.08707, 2023

  6. [6]

    Video-p2p: Video editing with cross-attention control,

    S. Liu, Y . Zhang, W. Li, Z. Lin, and J. Jia, “Video-p2p: Video editing with cross-attention control,” arXiv preprint arXiv:2303.04761 , 2023

  7. [7]

    Towards consistent video editing with text-to-image diffusion models,

    Z. Zhang, B. Li, X. Nie, C. Han, T. Guo, and L. Liu, “Towards consistent video editing with text-to-image diffusion models,” arXiv preprint arXiv:2305.17431, 2023

  8. [8]

    Controlvideo: Adding conditional control for one shot text-to-video editing,

    M. Zhao, R. Wang, F. Bao, C. Li, and J. Zhu, “Controlvideo: Adding conditional control for one shot text-to-video editing,” arXiv preprint arXiv:2305.17098, 2023

  9. [9]

    Zero-shot video editing using off-the-shelf image diffusion models,

    W. Wang, k. Xie, Z. Liu, H. Chen, Y . Cao, X. Wang, and C. Shen, “Zero-shot video editing using off-the-shelf image diffusion models,” arXiv preprint arXiv:2303.17599 , 2023

  10. [10]

    Pix2video: Video editing using image diffusion,

    D. Ceylan, C.-H. P. Huang, and N. J. Mitra, “Pix2video: Video editing using image diffusion,” in ICCV, 2023

  11. [11]

    Fatezero: Fusing attentions for zero-shot text-based video editing,

    C. Qi, X. Cun, Y . Zhang, C. Lei, X. Wang, Y . Shan, and Q. Chen, “Fatezero: Fusing attentions for zero-shot text-based video editing,” arXiv preprint arXiv:2303.09535 , 2023

  12. [12]

    Shape-aware text-driven layered video editing demo,

    Y .-C. Lee, J.-Z. G. J. Jang, Y .-T. Chen, E. Qiu, and J.-B. Huang, “Shape-aware text-driven layered video editing demo,” arXiv preprint arXiv:2301.13173, 2023

  13. [13]

    Make-a-protagonist: Generic video editing with an ensemble of experts,

    Y . Zhao, E. Xie, L. Hong, Z. Li, and G. H. Lee, “Make-a-protagonist: Generic video editing with an ensemble of experts,” arXiv preprint arXiv:2305.08850, 2023

  14. [14]

    Videograin: Modulating space- time attention for multi-grained video editing,

    X. Yang, L. Zhu, H. Fan, and Y . Yang, “Videograin: Modulating space- time attention for multi-grained video editing,” in ICLR, 2025

  15. [15]

    Multi-concept customization of text-to-image diffusion

    N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y . Zhu, “Multi-concept customization of text-to-image diffusion,”arXiv preprint arXiv:2212.04488, 2022

  16. [16]

    Dreampose: Fashion image-to-video synthesis via stable diffusion,

    J. Karras, A. Holynski, T.-C. Wang, and I. Kemelmacher-Shlizerman, “Dreampose: Fashion image-to-video synthesis via stable diffusion,” arXiv preprint arXiv:2304.06025 , 2023

  17. [17]

    Animate-a-story: Storytelling with retrieval- augmented video generation,

    Y . He, M. Xia, H. Chen, X. Cun, Y . Gong, J. Xing, Y . Zhang, X. Wang, C. Weng, Y . Shan et al. , “Animate-a-story: Storytelling with retrieval- augmented video generation,” arXiv preprint arXiv:2307.06940 , 2023

  18. [18]

    Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,

    Y . Guo, C. Yang, A. Rao, Y . Wang, Y . Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” in ICLR, 2024

  19. [19]

    Dynamicrafter: Animating open-domain images with video diffusion priors,

    J. Xing, M. Xia, Y . Zhang, H. Chen, X. Wang, T.-T. Wong, and Y . Shan, “Dynamicrafter: Animating open-domain images with video diffusion priors,” arXiv preprint arXiv:2310.12190 , 2023

  20. [20]

    Cosmos World Foundation Model Platform for Physical AI

    N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopad- hyay, Y . Chen, Y . Cui, Y . Dinget al., “Cosmos world foundation model platform for physical ai,” arXiv preprint arXiv:2501.03575 , 2025

  21. [21]

    Lavie: High-quality video gener- ation with cascaded latent diffusion models

    Y . Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y . Wang, C. Yang, Y . He, J. Yu, P. Yanget al., “Lavie: High-quality video generation with cascaded latent diffusion models,” arXiv preprint arXiv:2309.15103 , 2023

  22. [22]

    ModelScope Text-to-Video Technical Report

    J. Wang, H. Yuan, D. Chen, Y . Zhang, X. Wang, and S. Zhang, “Mod- elscope text-to-video technical report,”arXiv preprint arXiv:2308.06571, 2023

  23. [23]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    H. Chen, M. Xia, Y . He, Y . Zhang, X. Cun, S. Yang, J. Xing, Y . Liu, Q. Chen, X. Wang, C. Weng, and Y . Shan, “Videocrafter1: Open diffusion models for high-quality video generation,” arXiv preprint arXiv:2310.19512, 2023

  24. [24]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang, “CogVideo: Large- scale pretraining for text-to-video generation via transformers,” arXiv preprint arXiv:2205.15868, 2022

  25. [25]

    Vbench: Comprehensive benchmark suite for video generative models,

    Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisit et al., “Vbench: Comprehensive benchmark suite for video generative models,” in CVPR, 2024

  26. [26]

    Vbench++: Comprehensive and versatile benchmark suite for video generative models

    Z. Huang, F. Zhang, X. Xu, Y . He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y . Jiang, Y . Wang, X. Chen, Y .-C. Chen, L. Wang, D. Lin, Y . Qiao, and Z. Liu, “Vbench++: Comprehensive and versatile benchmark suite for video generative models,” arXiv preprint arXiv:2411.13503 , 2024

  27. [27]

    Evalcrafter: Benchmarking and evaluating large video generation models,

    Y . Liu, X. Cun, X. Liu, X. Wang, Y . Zhang, H. Chen, Y . Liu, T. Zeng, R. Chan, and Y . Shan, “Evalcrafter: Benchmarking and evaluating large video generation models,” in CVPR, 2024

  28. [28]

    [Online]

    OpenAI, “Sora,” Accessed February 15, 2024 [Online] https: //sora.com/library, 2024. [Online]. Available: https://sora.com/library

  29. [29]

    Team, “Kling,” Accessed December 9, 2024 [Online] https://klingai

    K. Team, “Kling,” Accessed December 9, 2024 [Online] https://klingai. kuaishou.com/, 2024. [Online]. Available: https://klingai.kuaishou.com/

  30. [30]

    com/research/introducing-gen-3-alpha, 2024

    runway, “Gen-3,” Accessed June 17, 2024 [Online] https://runwayml. com/research/introducing-gen-3-alpha, 2024. [Online]. Available: https: //runwayml.com/research/introducing-gen-3-alpha

  31. [31]

    Hunyuanvideo: A systematic framework for large video generative models,

    T. Team, “Hunyuanvideo: A systematic framework for large video generative models,” 2024

  32. [32]

    Team, “Veo2,” Accessed December 18, 2024 [Online] https: //deepmind.google/technologies/veo/veo-2/, 2025

    G. Team, “Veo2,” Accessed December 18, 2024 [Online] https: //deepmind.google/technologies/veo/veo-2/, 2025. [Online]. Available: https://deepmind.google/technologies/veo/veo-2/

  33. [33]

    Deep unsupervised learning using nonequilibrium thermodynamics,

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML, 2015

  34. [34]

    Score-based generative modeling through stochastic differ- ential equations,

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differ- ential equations,” in ICLR, 2021

  35. [35]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in NeurIPS, 2020

  36. [36]

    Denoising diffusion implicit models,

    J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in ICLR, 2021

  37. [37]

    Adding conditional control to text-to-image diffusion models,

    L. Zhang and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” arXiv preprint arXiv:2302.05543 , 2023

  38. [38]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Lettset al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint arXiv:2311.15127, 2023

  39. [39]

    Scaling rectified flow transformers for high-resolution image synthesis,

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” in ICML, 2024

  40. [40]

    T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    C. Mou, X. Wang, L. Xie, J. Zhang, Z. Qi, Y . Shan, and X. Qie, “T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models,” arXiv preprint arXiv:2302.08453 , 2023

  41. [41]

    Collaborative diffusion for multi-modal face generation and editing,

    Z. Huang, K. C. Chan, Y . Jiang, and Z. Liu, “Collaborative diffusion for multi-modal face generation and editing,” in CVPR, 2023

  42. [42]

    CogView: Mastering text-to-image generation via transformers,

    M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang et al., “CogView: Mastering text-to-image generation via transformers,” in NeurIPS, 2021

  43. [43]

    Cogview2: Faster and better text-to-image generation via hierarchical transformers,

    M. Ding, W. Zheng, W. Hong, and J. Tang, “Cogview2: Faster and better text-to-image generation via hierarchical transformers,” in NeurIPS, 2022

  44. [44]

    Imagen Video: High Definition Video Generation with Diffusion Models

    J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet et al. , “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022

  45. [45]

    Auto-Encoding Variational Bayes

    D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013

  46. [46]

    Neural discrete representation learning,

    A. Van Den Oord, O. Vinyals et al. , “Neural discrete representation learning,” in NeurIPS, 2017

  47. [47]

    Taming transformers for high- resolution image synthesis,

    P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- resolution image synthesis,” in CVPR, 2021

  48. [48]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952 , 2023

  49. [49]

    Magvit: Masked generative video transformer,

    L. Yu, Y . Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y . Hao, I. Essa et al. , “Magvit: Masked generative video transformer,” in CVPR, 2023

  50. [50]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929 , 2020

  51. [51]

    Scalable Diffusion Models with Transformers

    W. Peebles and S. Xie, “Scalable diffusion models with transformers,” arXiv preprint arXiv:2212.09748 , 2022

  52. [52]

    Videofusion: Decomposed diffusion models for high-quality video generation,

    Z. Luo, D. Chen, Y . Zhang, Y . Huang, L. Wang, Y . Shen, D. Zhao, J. Zhou, and T. Tan, “Videofusion: Decomposed diffusion models for high-quality video generation,” in CVPR, 2023. 13

  53. [53]

    Latent Video Diffusion Models for High-Fidelity Long Video Generation

    Y . He, T. Yang, Y . Zhang, Y . Shan, and Q. Chen, “Latent video diffusion models for high-fidelity video generation with arbitrary lengths,” arXiv preprint arXiv:2211.13221, 2022

  54. [54]

    Magicvideo: Efficient video generation with latent diffusion models

    D. Zhou, W. Wang, H. Yan, W. Lv, Y . Zhu, and J. Feng, “Magicvideo: Efficient video generation with latent diffusion models,” arXiv preprint arXiv:2211.11018, 2023

  55. [55]

    Show-1: Marrying pixel and latent diffusion models for text-to-video generation

    D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y . Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,” arXiv preprint arXiv:2309.15818 , 2023

  56. [56]

    Preserve your own correlation: A noise prior for video diffusion models,

    S. Ge, S. Nah, G. Liu, T. Poon, A. Tao, B. Catanzaro, D. Jacobs, J.- B. Huang, M.-Y . Liu, and Y . Balaji, “Preserve your own correlation: A noise prior for video diffusion models,” in ICCV, 2023

  57. [57]

    Align your latents: High-resolution video synthesis with latent diffusion models,

    A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” in CVPR, 2023

  58. [58]

    Text2video-zero: Text-to- image diffusion models are zero-shot video generators.arXiv preprint arXiv:2303.13439, 2023

    L. Khachatryan, A. Movsisyan, V . Tadevosyan, R. Henschel, Z. Wang, S. Navasardyan, and H. Shi, “Text2video-zero: Text-to-image dif- fusion models are zero-shot video generators,” arXiv preprint arXiv:2303.13439, 2023

  59. [59]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng et al., “Cogvideox: Text-to-video diffusion models with an expert transformer,” arXiv preprint arXiv:2408.06072 , 2024

  60. [60]

    Movie Gen: A Cast of Media Foundation Models

    A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y . Ma, C.-Y . Chuang, D. Yan, D. Choudhary, D. Wang, G. Sethi, G. Pang, H. Ma, I. Misra, J. Hou, J. Wang, K. Jagadeesh, K. Li, L. Zhang, M. Singh, M. Williamson, M. Le, M. Yu, M. K. Singh, P. Zhang, P. Vajda, Q. Duval, R. Girdhar, R. Sumbaly, S. S. Rambhatla, S. Tsai, S. Aza...

  61. [61]

    Wan: Open and advanced large-scale video generative models,

    W. Team, “Wan: Open and advanced large-scale video generative models,” 2025

  62. [62]
  63. [63]

    Team, “Minmax,” Accessed August 31, 2024 [Online] https: //hailuoai.com/, 2023

    M. Team, “Minmax,” Accessed August 31, 2024 [Online] https: //hailuoai.com/, 2023. [Online]. Available: https://hailuoai.com/

  64. [64]

    Vchitect-2.0: Parallel transformer for scaling up video diffusion models,

    W. Fan, C. Si, J. Song, Z. Yang, Y . He, L. Zhuo, Z. Huang, Z. Dong, J. He, D. Pan et al. , “Vchitect-2.0: Parallel transformer for scaling up video diffusion models,” arXiv preprint arXiv:2501.08453 , 2025

  65. [65]

    Repvideo: Rethink- ing cross-layer representation for video generation,

    C. Si, W. Fan, Z. Lv, Z. Huang, Y . Qiao, and Z. Liu, “Repvideo: Rethink- ing cross-layer representation for video generation,” arXiv 2501.08994, 2025

  66. [66]

    Open-sora 2.0: Training a commercial-level video generation model in $200 k

    X. Peng, Z. Zheng, C. Shen, T. Young, X. Guo, B. Wang, H. Xu, H. Liu, M. Jiang, W. Li, Y . Wang, A. Ye, G. Ren, Q. Ma, W. Liang, X. Lian, X. Wu, Y . Zhong, Z. Li, C. Gong, G. Lei, L. Cheng, L. Zhang, M. Li, R. Zhang, S. Hu, S. Huang, X. Wang, Y . Zhao, Y . Wang, Z. Wei, and Y . You, “Open-sora 2.0: Training a commercial-level video generation model in $20...

  67. [67]

    GANs trained by a two time-scale update rule converge to a local nash equilibrium,

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” in NeurIPS, 2017

  68. [68]

    Improved techniques for training gans,

    T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, X. Chen, and X. Chen, “Improved techniques for training gans,” in NeurIPS, 2016

  69. [69]

    FVD: A new metric for video generation,

    T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “FVD: A new metric for video generation,” in ICLRW, 2019

  70. [70]

    Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation,

    Y . Liu, L. Li, S. Ren, R. Gao, S. Li, S. Chen, X. Sun, and L. Hou, “Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation,” in NeurIPS, 2023

  71. [71]

    Evaluation agent: Efficient and promptable evaluation framework for visual generative models,

    F. Zhang, S. Tian, Z. Huang, Y . Qiao, and Z. Liu, “Evaluation agent: Efficient and promptable evaluation framework for visual generative models,” arXiv preprint arXiv:2412.09645 , 2024

  72. [72]

    Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363,

    F. Meng, J. Liao, X. Tan, W. Shao, Q. Lu, K. Zhang, Y . Cheng, D. Li, Y . Qiao, and P. Luo, “Towards world simulator: Crafting physical commonsense-based benchmark for video generation,” arXiv preprint arXiv:2410.05363, 2024

  73. [73]

    T2V- CompBench: A comprehensive benchmark for compositional text-to-video generation.arXiv preprint arXiv:2407.14505,

    K. Sun, K. Huang, X. Liu, Y . Wu, Z. Xu, Z. Li, and X. Liu, “T2v- compbench: A comprehensive benchmark for compositional text-to- video generation,” arXiv preprint arXiv:2407.14505 , 2024

  74. [74]

    Is your world simulator a good story presenter? a consecutive events-based benchmark for future long video generation,

    Y . Wang, X. He, K. Wang, L. Ma, J. Yang, S. Wang, S. S. Du, and Y . Shen, “Is your world simulator a good story presenter? a consecutive events-based benchmark for future long video generation,” arXiv preprint arXiv:2412.16211 , 2024

  75. [75]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data,”arXiv preprint arXiv:2410.02713, 2024

  76. [76]

    Qwen2.5 Technical Report

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei et al. , “Qwen2. 5 technical report,” arXiv preprint arXiv:2412.15115, 2024

  77. [77]

    Simmim: A simple framework for masked image modeling,

    Z. Xie, Z. Zhang, Y . Cao, Y . Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Simmim: A simple framework for masked image modeling,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 9653–9663

  78. [78]

    Yolo- world: Real-time open-vocabulary object detection,

    T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo- world: Real-time open-vocabulary object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16 901–16 911

  79. [79]

    Humanrefiner: Benchmarking abnormal human generation and refining with coarse-to-fine pose-reversible guidance,

    G. Fang, W. Yan, Y . Guo, J. Han, Z. Jiang, H. Xu, S. Liao, and X. Liang, “Humanrefiner: Benchmarking abnormal human generation and refining with coarse-to-fine pose-reversible guidance,” in European Conference on Computer Vision . Springer, 2024, pp. 201–217

  80. [80]

    Arcface: Additive angular margin loss for deep face recognition,

    J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in CVPR, 2019

Showing first 80 references.