arxiv: 2503.21755 · v2 · submitted 2025-03-27 · 💻 cs.CV

Recognition: 3 theorem links

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng , Ziqi Huang , Hongbo Liu , Kai Zou , Yinan He , Fan Zhang , Lulu Gu , Yuanhan Zhang

show 4 more authors

Jingwen He Wei-Shi Zheng Yu Qiao Ziwei Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationbenchmarkintrinsic faithfulnessphysics evaluationcommonsense reasoninghuman fidelitycontrollabilityanomaly detection

0 comments

The pith

VBench-2.0 introduces a benchmark that tests video generation models for intrinsic faithfulness to physical laws, human anatomy, and commonsense.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that prior video benchmarks only check superficial qualities such as visual appeal and basic prompt adherence. It introduces VBench-2.0 to measure deeper intrinsic faithfulness across five dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense. Each dimension includes fine-grained capabilities scored automatically by vision-language models, language models, and anomaly detection methods. Human annotations validate that these scores align with human judgment. The work targets the next step toward video models that function as reliable world simulators rather than merely plausible image sequences.

Core claim

VBench-2.0 is a benchmark suite designed to automatically evaluate video generative models for intrinsic faithfulness, meaning adherence to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. It organizes evaluation into five dimensions—Human Fidelity, Controllability, Creativity, Physics, and Commonsense—each subdivided into targeted capabilities. The framework applies tailored combinations of state-of-the-art vision-language models, large language models, and specialist anomaly detection techniques, all cross-checked against extensive human annotations.

What carries the argument

VBench-2.0 framework that integrates generalist VLMs and LLMs with video-specific anomaly detection methods to score fine-grained capabilities within each faithfulness dimension.

If this is right

Models achieving high VBench-2.0 scores should support more reliable AI-assisted filmmaking and simulated world modeling.
The five-dimension breakdown enables targeted model improvements in areas such as physics adherence or human anatomical accuracy.
Automatic metrics validated by human annotations can serve as scalable proxies for ongoing model development.
Progress on intrinsic faithfulness metrics marks a shift from visually coherent outputs to fundamentally realistic video generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption could steer video model training toward explicit rule-violation penalties rather than only aesthetic rewards.
The same dimension structure might transfer to benchmarks for other generative domains such as 3D scene synthesis or interactive simulation.
Integration with embodied AI systems could use VBench-2.0 scores to predict how well generated videos translate into accurate planning or control signals.

Load-bearing premise

The combination of current top vision-language models, language models, and anomaly detectors can detect violations of physics and commonsense rules without missing subtle failures or introducing new evaluation biases.

What would settle it

A test set of generated videos where human raters consistently flag clear physics or commonsense violations that the automated VBench-2.0 scores rate as acceptable, or where the benchmark flags problems that humans accept.

read the original abstract

Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench have been developed to assess their faithfulness, measuring factors like per-frame aesthetics, temporal consistency, and basic prompt adherence. However, these aspects mainly represent superficial faithfulness, which focus on whether the video appears visually convincing rather than whether it adheres to real-world principles. While recent models perform increasingly well on these metrics, they still struggle to generate videos that are not just visually plausible but fundamentally realistic. To achieve real "world models" through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. Achieving this level of realism is essential for applications such as AI-assisted filmmaking and simulated world modeling. To bridge this gap, we introduce VBench-2.0, a next-generation benchmark designed to automatically evaluate video generative models for their intrinsic faithfulness. VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, each further broken down into fine-grained capabilities. Tailored to individual dimensions, our evaluation framework integrates generalists such as SOTA VLMs and LLMs, and specialists, including anomaly detection methods proposed for video generation. We conduct extensive human annotations to ensure evaluation alignment with human judgment. By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench-2.0 aims to set a new standard for the next generation of video generative models in pursuit of intrinsic faithfulness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VBench-2.0 adds a useful set of intrinsic faithfulness dimensions to video benchmarks, but the automated scoring still risks missing subtle violations.

read the letter

VBench-2.0 moves the evaluation of video generation past surface-level looks and toward whether outputs actually follow physics, anatomy, and commonsense. That shift is the main point worth noting right away. The paper defines five dimensions—Human Fidelity, Controllability, Creativity, Physics, and Commonsense—each split into finer capabilities, and it pairs general VLMs and LLMs with specialist anomaly detectors. Human annotations are used to align the automatic scores, which is a practical step beyond the original VBench focus on aesthetics and basic consistency. This setup gives model developers clearer signals for building outputs that could serve as world models rather than just convincing clips. The motivation section explains the gap well: current models score high on superficial metrics but still fail on real-world rules, and the new framework aims to close that. The hybrid evaluation approach looks workable for scaling checks without full manual review every time. The soft spot is the dependence on VLMs and LLMs for catching fine-grained errors. These models can overlook small trajectory mistakes or odd causal sequences even when humans flag the obvious ones, and the paper's human validation on subsets may not cover every edge case that matters. If the experiments show strong correlation only on coarse prompts, the scores could still overstate faithfulness. The citation pattern stays focused on prior video benchmarks and VLM work, which fits the contribution. This paper is for groups actively training or evaluating video generators who need better yardsticks than existing suites provide. A reader working on simulation or controllable generation would find the dimension breakdown and protocol details directly usable. It deserves a serious referee because the core framing addresses a real limitation in the field and the methods are concrete enough to review and improve.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces VBench-2.0, a next-generation benchmark for video generative models that shifts evaluation from superficial faithfulness (aesthetics, temporal consistency, prompt adherence) to intrinsic faithfulness. It defines five dimensions—Human Fidelity, Controllability, Creativity, Physics, and Commonsense—each decomposed into fine-grained capabilities, and proposes an automated evaluation pipeline that combines SOTA VLMs, LLMs, and anomaly detectors, with alignment ensured via extensive human annotations.

Significance. If the automated evaluators prove reliable, VBench-2.0 would provide a scalable, reproducible standard that pushes video generation research toward genuine world-modeling capabilities rather than visually convincing but physically implausible outputs. The explicit human-validation step and dimension-specific specialist modules are concrete strengths that could accelerate progress in AI-assisted filmmaking and simulated environments.

major comments (2)

[§4.3] §4.3 (Physics dimension evaluation): the claim that the anomaly-detection + VLM pipeline reliably identifies subtle violations (e.g., incorrect object trajectories under gravity or implausible causal sequences) rests on correlation with human raters on a validation subset; this does not demonstrate coverage of edge cases where the same VLMs exhibit known reasoning failures, leaving the central reliability claim under-supported.
[Table 5] Table 5 (human-automated alignment): the reported Pearson correlations for the Commonsense and Physics dimensions are computed on prompts that appear coarse-grained; without an explicit stress-test on fine-grained violation prompts, it is unclear whether high alignment on the validation set generalizes to the full benchmark distribution.

minor comments (3)

[§2.1] §2.1: the distinction between “superficial” and “intrinsic” faithfulness is introduced without a formal definition or reference to prior literature on physical commonsense in video; a short clarifying paragraph would improve precision.
[Figure 3] Figure 3: axis labels on the radar plots for the five dimensions are too small to read in print; increasing font size or adding a legend table would aid clarity.
[§5.2] §5.2: several citations to the original VBench paper are given only by name without year or arXiv identifier; adding full references would help readers locate the baseline metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below, explaining our response and the revisions we will incorporate.

read point-by-point responses

Referee: [§4.3] §4.3 (Physics dimension evaluation): the claim that the anomaly-detection + VLM pipeline reliably identifies subtle violations (e.g., incorrect object trajectories under gravity or implausible causal sequences) rests on correlation with human raters on a validation subset; this does not demonstrate coverage of edge cases where the same VLMs exhibit known reasoning failures, leaving the central reliability claim under-supported.

Authors: We appreciate the referee's point that correlation on a validation subset does not automatically guarantee coverage of all VLM reasoning failure modes in edge cases. Our human annotation protocol was designed to include diverse physical violation scenarios, and the specialist anomaly detectors were introduced precisely to compensate for known VLM limitations in causal and trajectory reasoning. In the revision we will add an explicit limitations paragraph in §4.3 together with qualitative examples of edge cases where the combined pipeline succeeds or fails, thereby making the reliability argument more transparent. This is a partial revision because we build on the existing human-validated data rather than collecting new annotations. revision: partial
Referee: [Table 5] Table 5 (human-automated alignment): the reported Pearson correlations for the Commonsense and Physics dimensions are computed on prompts that appear coarse-grained; without an explicit stress-test on fine-grained violation prompts, it is unclear whether high alignment on the validation set generalizes to the full benchmark distribution.

Authors: We acknowledge that the prompts used for the reported correlations may appear predominantly coarse-grained. To demonstrate generalization, we will add a new appendix section containing a stress-test on a curated set of fine-grained violation prompts (drawn from the same human-annotation pool) and will report the updated Pearson correlations for both dimensions. This addition directly addresses the concern and will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark relies on external VLMs/LLMs with human validation

full rationale

The paper introduces VBench-2.0 as an evaluation framework for video generation models across five dimensions (Human Fidelity, Controllability, Creativity, Physics, Commonsense). It integrates pre-existing SOTA VLMs, LLMs, and anomaly detectors, then aligns them via human annotations. No mathematical derivations, fitted parameters, or predictions appear in the provided text. The reference to prior 'VBench' work is a minor contextual citation and not load-bearing for the central claims, which consist of new dimension definitions and an external evaluation pipeline. No self-definitional loops, fitted-input predictions, or ansatz smuggling are present. The construction is self-contained against external benchmarks and human validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work introduces no new mathematical derivations, free parameters, or invented physical entities; it relies on existing pre-trained VLMs and LLMs plus human annotation protocols as evaluation tools.

pith-pipeline@v0.9.0 · 5626 in / 1110 out tokens · 32570 ms · 2026-05-14T18:37:01.650654+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

To achieve real “world models” through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PhysInOne: Visual Physics Learning and Reasoning in One Suite
cs.CV 2026-04 unverdicted novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
cs.CV 2026-05 conditional novelty 7.0

EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.
KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
cs.CV 2026-05 unverdicted novelty 7.0

KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.
PhyGround: Benchmarking Physical Reasoning in Generative World Models
cs.CV 2026-05 accept novelty 7.0

PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models
cs.CV 2026-05 unverdicted novelty 7.0

WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
cs.CV 2026-04 unverdicted novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
AnimationBench: Are Video Models Good at Character-Centric Animation?
cs.CV 2026-04 unverdicted novelty 7.0

AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.
PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
cs.CV 2026-05 conditional novelty 6.0

PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than pr...
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
cs.CV 2026-05 unverdicted novelty 6.0

The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.
WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models
cs.CV 2026-05 conditional novelty 6.0

WorldJen is a multi-dimensional video generation benchmark using VLM-graded Likert questionnaires on joint prompts, validated to match human three-tier rankings.
HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation
cs.CV 2026-04 unverdicted novelty 6.0

HuM-Eval evaluates human motion videos with a coarse-to-fine approach using VLM global checks plus 2D pose and 3D motion analysis, reaching 58.2% average correlation with human judgments and introducing a 1000-prompt ...
Seeing Fast and Slow: Learning the Flow of Time in Videos
cs.CV 2026-04 unverdicted novelty 6.0

Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.
Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation
cs.CV 2026-04 unverdicted novelty 6.0

Long-CODE isolates long-context video evaluation with a new benchmark dataset and shot-dynamics metric that correlates better with human judgments on narrative richness and global consistency than short-video metrics.
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
cs.CV 2026-04 unverdicted novelty 6.0

VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks
cs.CV 2026-04 unverdicted novelty 6.0

ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 5.0

Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)
cs.CV 2026-05 conditional novelty 5.0

The PhyScore challenge creates the first benchmark requiring metrics to jointly score video quality, physical realism, condition alignment, and temporal consistency while localizing physical anomalies in 1554 videos f...
Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
cs.CV 2026-04 unverdicted novelty 5.0

Phantom generates visually realistic and physically consistent videos by jointly modeling visual content and latent physical dynamics via an abstract physics-aware representation.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
cs.CV 2026-05 unverdicted novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · cited by 24 Pith papers · 17 internal anchors

[1]

Magicedit: High-fidelity and temporally coherent video editing,

J. H. Liew, H. Yan, J. Zhang, Z. Xu, and J. Feng, “Magicedit: High-fidelity and temporally coherent video editing,” arXiv preprint arXiv:2308.14749, 2023

work page arXiv 2023
[2]

Stablevideo: Text- driven consistency-aware diffusion video editing,

W. Chai, X. Guo, G. Wang, and Y . Lu, “Stablevideo: Text- driven consistency-aware diffusion video editing,” arXiv preprint arXiv:2308.09592, 2023

work page arXiv 2023
[3]

Tokenflow: Con- sistent diffusion features for consistent video editing,

M. Geyer, O. Bar-Tal, S. Bagon, and T. Dekel, “Tokenflow: Con- sistent diffusion features for consistent video editing,” arXiv preprint arxiv:2307.10373, 2023

work page arXiv 2023
[4]

Inve: Interactive neural video editing,

J. Huang, L. Sigal, K. M. Yi, O. Wang, and J.-Y . Lee, “Inve: Interactive neural video editing,” arXiv preprint arXiv:2307.07663 , 2023

work page arXiv 2023
[5]

Videdit: Zero-shot and spatially aware text-driven video editing,

P. Couairon, C. Rambour, J.-E. Haugeard, and N. Thome, “Videdit: Zero-shot and spatially aware text-driven video editing,” arXiv preprint arXiv:2306.08707, 2023

work page arXiv 2023
[6]

Video-p2p: Video editing with cross-attention control,

S. Liu, Y . Zhang, W. Li, Z. Lin, and J. Jia, “Video-p2p: Video editing with cross-attention control,” arXiv preprint arXiv:2303.04761 , 2023

work page arXiv 2023
[7]

Towards consistent video editing with text-to-image diffusion models,

Z. Zhang, B. Li, X. Nie, C. Han, T. Guo, and L. Liu, “Towards consistent video editing with text-to-image diffusion models,” arXiv preprint arXiv:2305.17431, 2023

work page arXiv 2023
[8]

Controlvideo: Adding conditional control for one shot text-to-video editing,

M. Zhao, R. Wang, F. Bao, C. Li, and J. Zhu, “Controlvideo: Adding conditional control for one shot text-to-video editing,” arXiv preprint arXiv:2305.17098, 2023

work page arXiv 2023
[9]

Zero-shot video editing using off-the-shelf image diffusion models,

W. Wang, k. Xie, Z. Liu, H. Chen, Y . Cao, X. Wang, and C. Shen, “Zero-shot video editing using off-the-shelf image diffusion models,” arXiv preprint arXiv:2303.17599 , 2023

work page arXiv 2023
[10]

Pix2video: Video editing using image diffusion,

D. Ceylan, C.-H. P. Huang, and N. J. Mitra, “Pix2video: Video editing using image diffusion,” in ICCV, 2023

work page 2023
[11]

Fatezero: Fusing attentions for zero-shot text-based video editing,

C. Qi, X. Cun, Y . Zhang, C. Lei, X. Wang, Y . Shan, and Q. Chen, “Fatezero: Fusing attentions for zero-shot text-based video editing,” arXiv preprint arXiv:2303.09535 , 2023

work page arXiv 2023
[12]

Shape-aware text-driven layered video editing demo,

Y .-C. Lee, J.-Z. G. J. Jang, Y .-T. Chen, E. Qiu, and J.-B. Huang, “Shape-aware text-driven layered video editing demo,” arXiv preprint arXiv:2301.13173, 2023

work page arXiv 2023
[13]

Make-a-protagonist: Generic video editing with an ensemble of experts,

Y . Zhao, E. Xie, L. Hong, Z. Li, and G. H. Lee, “Make-a-protagonist: Generic video editing with an ensemble of experts,” arXiv preprint arXiv:2305.08850, 2023

work page arXiv 2023
[14]

Videograin: Modulating space- time attention for multi-grained video editing,

X. Yang, L. Zhu, H. Fan, and Y . Yang, “Videograin: Modulating space- time attention for multi-grained video editing,” in ICLR, 2025

work page 2025
[15]

Multi-concept customization of text-to-image diffusion

N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y . Zhu, “Multi-concept customization of text-to-image diffusion,”arXiv preprint arXiv:2212.04488, 2022

work page arXiv 2022
[16]

Dreampose: Fashion image-to-video synthesis via stable diffusion,

J. Karras, A. Holynski, T.-C. Wang, and I. Kemelmacher-Shlizerman, “Dreampose: Fashion image-to-video synthesis via stable diffusion,” arXiv preprint arXiv:2304.06025 , 2023

work page arXiv 2023
[17]

Animate-a-story: Storytelling with retrieval- augmented video generation,

Y . He, M. Xia, H. Chen, X. Cun, Y . Gong, J. Xing, Y . Zhang, X. Wang, C. Weng, Y . Shan et al. , “Animate-a-story: Storytelling with retrieval- augmented video generation,” arXiv preprint arXiv:2307.06940 , 2023

work page arXiv 2023
[18]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,

Y . Guo, C. Yang, A. Rao, Y . Wang, Y . Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” in ICLR, 2024

work page 2024
[19]

Dynamicrafter: Animating open-domain images with video diffusion priors,

J. Xing, M. Xia, Y . Zhang, H. Chen, X. Wang, T.-T. Wong, and Y . Shan, “Dynamicrafter: Animating open-domain images with video diffusion priors,” arXiv preprint arXiv:2310.12190 , 2023

work page arXiv 2023
[20]

Cosmos World Foundation Model Platform for Physical AI

N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopad- hyay, Y . Chen, Y . Cui, Y . Dinget al., “Cosmos world foundation model platform for physical ai,” arXiv preprint arXiv:2501.03575 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Lavie: High-quality video gener- ation with cascaded latent diffusion models

Y . Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y . Wang, C. Yang, Y . He, J. Yu, P. Yanget al., “Lavie: High-quality video generation with cascaded latent diffusion models,” arXiv preprint arXiv:2309.15103 , 2023

work page arXiv 2023
[22]

ModelScope Text-to-Video Technical Report

J. Wang, H. Yuan, D. Chen, Y . Zhang, X. Wang, and S. Zhang, “Mod- elscope text-to-video technical report,”arXiv preprint arXiv:2308.06571, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

H. Chen, M. Xia, Y . He, Y . Zhang, X. Cun, S. Yang, J. Xing, Y . Liu, Q. Chen, X. Wang, C. Weng, and Y . Shan, “Videocrafter1: Open diffusion models for high-quality video generation,” arXiv preprint arXiv:2310.19512, 2023

work page internal anchor Pith review arXiv 2023
[24]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang, “CogVideo: Large- scale pretraining for text-to-video generation via transformers,” arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Vbench: Comprehensive benchmark suite for video generative models,

Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisit et al., “Vbench: Comprehensive benchmark suite for video generative models,” in CVPR, 2024

work page 2024
[26]

Vbench++: Comprehensive and versatile benchmark suite for video generative models

Z. Huang, F. Zhang, X. Xu, Y . He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y . Jiang, Y . Wang, X. Chen, Y .-C. Chen, L. Wang, D. Lin, Y . Qiao, and Z. Liu, “Vbench++: Comprehensive and versatile benchmark suite for video generative models,” arXiv preprint arXiv:2411.13503 , 2024

work page arXiv 2024
[27]

Evalcrafter: Benchmarking and evaluating large video generation models,

Y . Liu, X. Cun, X. Liu, X. Wang, Y . Zhang, H. Chen, Y . Liu, T. Zeng, R. Chan, and Y . Shan, “Evalcrafter: Benchmarking and evaluating large video generation models,” in CVPR, 2024

work page 2024
[28]

[Online]

OpenAI, “Sora,” Accessed February 15, 2024 [Online] https: //sora.com/library, 2024. [Online]. Available: https://sora.com/library

work page 2024
[29]

Team, “Kling,” Accessed December 9, 2024 [Online] https://klingai

K. Team, “Kling,” Accessed December 9, 2024 [Online] https://klingai. kuaishou.com/, 2024. [Online]. Available: https://klingai.kuaishou.com/

work page 2024
[30]

com/research/introducing-gen-3-alpha, 2024

runway, “Gen-3,” Accessed June 17, 2024 [Online] https://runwayml. com/research/introducing-gen-3-alpha, 2024. [Online]. Available: https: //runwayml.com/research/introducing-gen-3-alpha

work page 2024
[31]

Hunyuanvideo: A systematic framework for large video generative models,

T. Team, “Hunyuanvideo: A systematic framework for large video generative models,” 2024

work page 2024
[32]

Team, “Veo2,” Accessed December 18, 2024 [Online] https: //deepmind.google/technologies/veo/veo-2/, 2025

G. Team, “Veo2,” Accessed December 18, 2024 [Online] https: //deepmind.google/technologies/veo/veo-2/, 2025. [Online]. Available: https://deepmind.google/technologies/veo/veo-2/

work page 2024
[33]

Deep unsupervised learning using nonequilibrium thermodynamics,

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML, 2015

work page 2015
[34]

Score-based generative modeling through stochastic differ- ential equations,

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differ- ential equations,” in ICLR, 2021

work page 2021
[35]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in NeurIPS, 2020

work page 2020
[36]

Denoising diffusion implicit models,

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in ICLR, 2021

work page 2021
[37]

Adding conditional control to text-to-image diffusion models,

L. Zhang and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” arXiv preprint arXiv:2302.05543 , 2023

work page arXiv 2023
[38]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Lettset al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” in ICML, 2024

work page 2024
[40]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

C. Mou, X. Wang, L. Xie, J. Zhang, Z. Qi, Y . Shan, and X. Qie, “T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models,” arXiv preprint arXiv:2302.08453 , 2023

work page arXiv 2023
[41]

Collaborative diffusion for multi-modal face generation and editing,

Z. Huang, K. C. Chan, Y . Jiang, and Z. Liu, “Collaborative diffusion for multi-modal face generation and editing,” in CVPR, 2023

work page 2023
[42]

CogView: Mastering text-to-image generation via transformers,

M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang et al., “CogView: Mastering text-to-image generation via transformers,” in NeurIPS, 2021

work page 2021
[43]

Cogview2: Faster and better text-to-image generation via hierarchical transformers,

M. Ding, W. Zheng, W. Hong, and J. Tang, “Cogview2: Faster and better text-to-image generation via hierarchical transformers,” in NeurIPS, 2022

work page 2022
[44]

Imagen Video: High Definition Video Generation with Diffusion Models

J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet et al. , “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Auto-Encoding Variational Bayes

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[46]

Neural discrete representation learning,

A. Van Den Oord, O. Vinyals et al. , “Neural discrete representation learning,” in NeurIPS, 2017

work page 2017
[47]

Taming transformers for high- resolution image synthesis,

P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- resolution image synthesis,” in CVPR, 2021

work page 2021
[48]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Magvit: Masked generative video transformer,

L. Yu, Y . Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y . Hao, I. Essa et al. , “Magvit: Masked generative video transformer,” in CVPR, 2023

work page 2023
[50]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[51]

Scalable Diffusion Models with Transformers

W. Peebles and S. Xie, “Scalable diffusion models with transformers,” arXiv preprint arXiv:2212.09748 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

Videofusion: Decomposed diffusion models for high-quality video generation,

Z. Luo, D. Chen, Y . Zhang, Y . Huang, L. Wang, Y . Shen, D. Zhao, J. Zhou, and T. Tan, “Videofusion: Decomposed diffusion models for high-quality video generation,” in CVPR, 2023. 13

work page 2023
[53]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Y . He, T. Yang, Y . Zhang, Y . Shan, and Q. Chen, “Latent video diffusion models for high-fidelity video generation with arbitrary lengths,” arXiv preprint arXiv:2211.13221, 2022

work page internal anchor Pith review arXiv 2022
[54]

Magicvideo: Efﬁcient video generation with latent diffusion models

D. Zhou, W. Wang, H. Yan, W. Lv, Y . Zhu, and J. Feng, “Magicvideo: Efficient video generation with latent diffusion models,” arXiv preprint arXiv:2211.11018, 2023

work page arXiv 2023
[55]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation

D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y . Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,” arXiv preprint arXiv:2309.15818 , 2023

work page arXiv 2023
[56]

Preserve your own correlation: A noise prior for video diffusion models,

S. Ge, S. Nah, G. Liu, T. Poon, A. Tao, B. Catanzaro, D. Jacobs, J.- B. Huang, M.-Y . Liu, and Y . Balaji, “Preserve your own correlation: A noise prior for video diffusion models,” in ICCV, 2023

work page 2023
[57]

Align your latents: High-resolution video synthesis with latent diffusion models,

A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” in CVPR, 2023

work page 2023
[58]

Text2video-zero: Text-to- image diffusion models are zero-shot video generators.arXiv preprint arXiv:2303.13439, 2023

L. Khachatryan, A. Movsisyan, V . Tadevosyan, R. Henschel, Z. Wang, S. Navasardyan, and H. Shi, “Text2video-zero: Text-to-image dif- fusion models are zero-shot video generators,” arXiv preprint arXiv:2303.13439, 2023

work page arXiv 2023
[59]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng et al., “Cogvideox: Text-to-video diffusion models with an expert transformer,” arXiv preprint arXiv:2408.06072 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Movie Gen: A Cast of Media Foundation Models

A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y . Ma, C.-Y . Chuang, D. Yan, D. Choudhary, D. Wang, G. Sethi, G. Pang, H. Ma, I. Misra, J. Hou, J. Wang, K. Jagadeesh, K. Li, L. Zhang, M. Singh, M. Williamson, M. Le, M. Yu, M. K. Singh, P. Zhang, P. Vajda, Q. Duval, R. Girdhar, R. Sumbaly, S. S. Rambhatla, S. Tsai, S. Aza...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Wan: Open and advanced large-scale video generative models,

W. Team, “Wan: Open and advanced large-scale video generative models,” 2025

work page 2025
[62]

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model, February 2025

S. Team, 2025. [Online]. Available: https://arxiv.org/abs/2502.10248

work page arXiv 2025
[63]

Team, “Minmax,” Accessed August 31, 2024 [Online] https: //hailuoai.com/, 2023

M. Team, “Minmax,” Accessed August 31, 2024 [Online] https: //hailuoai.com/, 2023. [Online]. Available: https://hailuoai.com/

work page 2024
[64]

Vchitect-2.0: Parallel transformer for scaling up video diffusion models,

W. Fan, C. Si, J. Song, Z. Yang, Y . He, L. Zhuo, Z. Huang, Z. Dong, J. He, D. Pan et al. , “Vchitect-2.0: Parallel transformer for scaling up video diffusion models,” arXiv preprint arXiv:2501.08453 , 2025

work page arXiv 2025
[65]

Repvideo: Rethink- ing cross-layer representation for video generation,

C. Si, W. Fan, Z. Lv, Z. Huang, Y . Qiao, and Z. Liu, “Repvideo: Rethink- ing cross-layer representation for video generation,” arXiv 2501.08994, 2025

work page arXiv 2025
[66]

Open-sora 2.0: Training a commercial-level video generation model in $200 k

X. Peng, Z. Zheng, C. Shen, T. Young, X. Guo, B. Wang, H. Xu, H. Liu, M. Jiang, W. Li, Y . Wang, A. Ye, G. Ren, Q. Ma, W. Liang, X. Lian, X. Wu, Y . Zhong, Z. Li, C. Gong, G. Lei, L. Cheng, L. Zhang, M. Li, R. Zhang, S. Hu, S. Huang, X. Wang, Y . Zhao, Y . Wang, Z. Wei, and Y . You, “Open-sora 2.0: Training a commercial-level video generation model in $20...

work page arXiv 2025
[67]

GANs trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” in NeurIPS, 2017

work page 2017
[68]

Improved techniques for training gans,

T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, X. Chen, and X. Chen, “Improved techniques for training gans,” in NeurIPS, 2016

work page 2016
[69]

FVD: A new metric for video generation,

T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “FVD: A new metric for video generation,” in ICLRW, 2019

work page 2019
[70]

Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation,

Y . Liu, L. Li, S. Ren, R. Gao, S. Li, S. Chen, X. Sun, and L. Hou, “Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation,” in NeurIPS, 2023

work page 2023
[71]

Evaluation agent: Efficient and promptable evaluation framework for visual generative models,

F. Zhang, S. Tian, Z. Huang, Y . Qiao, and Z. Liu, “Evaluation agent: Efficient and promptable evaluation framework for visual generative models,” arXiv preprint arXiv:2412.09645 , 2024

work page arXiv 2024
[72]

Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363,

F. Meng, J. Liao, X. Tan, W. Shao, Q. Lu, K. Zhang, Y . Cheng, D. Li, Y . Qiao, and P. Luo, “Towards world simulator: Crafting physical commonsense-based benchmark for video generation,” arXiv preprint arXiv:2410.05363, 2024

work page arXiv 2024
[73]

T2V- CompBench: A comprehensive benchmark for compositional text-to-video generation.arXiv preprint arXiv:2407.14505,

K. Sun, K. Huang, X. Liu, Y . Wu, Z. Xu, Z. Li, and X. Liu, “T2v- compbench: A comprehensive benchmark for compositional text-to- video generation,” arXiv preprint arXiv:2407.14505 , 2024

work page arXiv 2024
[74]

Is your world simulator a good story presenter? a consecutive events-based benchmark for future long video generation,

Y . Wang, X. He, K. Wang, L. Ma, J. Yang, S. Wang, S. S. Du, and Y . Shen, “Is your world simulator a good story presenter? a consecutive events-based benchmark for future long video generation,” arXiv preprint arXiv:2412.16211 , 2024

work page arXiv 2024
[75]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data,”arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei et al. , “Qwen2. 5 technical report,” arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[77]

Simmim: A simple framework for masked image modeling,

Z. Xie, Z. Zhang, Y . Cao, Y . Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Simmim: A simple framework for masked image modeling,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 9653–9663

work page 2022
[78]

Yolo- world: Real-time open-vocabulary object detection,

T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo- world: Real-time open-vocabulary object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16 901–16 911

work page 2024
[79]

Humanrefiner: Benchmarking abnormal human generation and refining with coarse-to-fine pose-reversible guidance,

G. Fang, W. Yan, Y . Guo, J. Han, Z. Jiang, H. Xu, S. Liao, and X. Liang, “Humanrefiner: Benchmarking abnormal human generation and refining with coarse-to-fine pose-reversible guidance,” in European Conference on Computer Vision . Springer, 2024, pp. 201–217

work page 2024
[80]

Arcface: Additive angular margin loss for deep face recognition,

J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in CVPR, 2019

work page 2019

Showing first 80 references.