Recognition: 3 theorem links
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
Pith reviewed 2026-05-14 18:37 UTC · model grok-4.3
The pith
VBench-2.0 introduces a benchmark that tests video generation models for intrinsic faithfulness to physical laws, human anatomy, and commonsense.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VBench-2.0 is a benchmark suite designed to automatically evaluate video generative models for intrinsic faithfulness, meaning adherence to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. It organizes evaluation into five dimensions—Human Fidelity, Controllability, Creativity, Physics, and Commonsense—each subdivided into targeted capabilities. The framework applies tailored combinations of state-of-the-art vision-language models, large language models, and specialist anomaly detection techniques, all cross-checked against extensive human annotations.
What carries the argument
VBench-2.0 framework that integrates generalist VLMs and LLMs with video-specific anomaly detection methods to score fine-grained capabilities within each faithfulness dimension.
If this is right
- Models achieving high VBench-2.0 scores should support more reliable AI-assisted filmmaking and simulated world modeling.
- The five-dimension breakdown enables targeted model improvements in areas such as physics adherence or human anatomical accuracy.
- Automatic metrics validated by human annotations can serve as scalable proxies for ongoing model development.
- Progress on intrinsic faithfulness metrics marks a shift from visually coherent outputs to fundamentally realistic video generation.
Where Pith is reading between the lines
- Widespread adoption could steer video model training toward explicit rule-violation penalties rather than only aesthetic rewards.
- The same dimension structure might transfer to benchmarks for other generative domains such as 3D scene synthesis or interactive simulation.
- Integration with embodied AI systems could use VBench-2.0 scores to predict how well generated videos translate into accurate planning or control signals.
Load-bearing premise
The combination of current top vision-language models, language models, and anomaly detectors can detect violations of physics and commonsense rules without missing subtle failures or introducing new evaluation biases.
What would settle it
A test set of generated videos where human raters consistently flag clear physics or commonsense violations that the automated VBench-2.0 scores rate as acceptable, or where the benchmark flags problems that humans accept.
read the original abstract
Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench have been developed to assess their faithfulness, measuring factors like per-frame aesthetics, temporal consistency, and basic prompt adherence. However, these aspects mainly represent superficial faithfulness, which focus on whether the video appears visually convincing rather than whether it adheres to real-world principles. While recent models perform increasingly well on these metrics, they still struggle to generate videos that are not just visually plausible but fundamentally realistic. To achieve real "world models" through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. Achieving this level of realism is essential for applications such as AI-assisted filmmaking and simulated world modeling. To bridge this gap, we introduce VBench-2.0, a next-generation benchmark designed to automatically evaluate video generative models for their intrinsic faithfulness. VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, each further broken down into fine-grained capabilities. Tailored to individual dimensions, our evaluation framework integrates generalists such as SOTA VLMs and LLMs, and specialists, including anomaly detection methods proposed for video generation. We conduct extensive human annotations to ensure evaluation alignment with human judgment. By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench-2.0 aims to set a new standard for the next generation of video generative models in pursuit of intrinsic faithfulness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VBench-2.0, a next-generation benchmark for video generative models that shifts evaluation from superficial faithfulness (aesthetics, temporal consistency, prompt adherence) to intrinsic faithfulness. It defines five dimensions—Human Fidelity, Controllability, Creativity, Physics, and Commonsense—each decomposed into fine-grained capabilities, and proposes an automated evaluation pipeline that combines SOTA VLMs, LLMs, and anomaly detectors, with alignment ensured via extensive human annotations.
Significance. If the automated evaluators prove reliable, VBench-2.0 would provide a scalable, reproducible standard that pushes video generation research toward genuine world-modeling capabilities rather than visually convincing but physically implausible outputs. The explicit human-validation step and dimension-specific specialist modules are concrete strengths that could accelerate progress in AI-assisted filmmaking and simulated environments.
major comments (2)
- [§4.3] §4.3 (Physics dimension evaluation): the claim that the anomaly-detection + VLM pipeline reliably identifies subtle violations (e.g., incorrect object trajectories under gravity or implausible causal sequences) rests on correlation with human raters on a validation subset; this does not demonstrate coverage of edge cases where the same VLMs exhibit known reasoning failures, leaving the central reliability claim under-supported.
- [Table 5] Table 5 (human-automated alignment): the reported Pearson correlations for the Commonsense and Physics dimensions are computed on prompts that appear coarse-grained; without an explicit stress-test on fine-grained violation prompts, it is unclear whether high alignment on the validation set generalizes to the full benchmark distribution.
minor comments (3)
- [§2.1] §2.1: the distinction between “superficial” and “intrinsic” faithfulness is introduced without a formal definition or reference to prior literature on physical commonsense in video; a short clarifying paragraph would improve precision.
- [Figure 3] Figure 3: axis labels on the radar plots for the five dimensions are too small to read in print; increasing font size or adding a legend table would aid clarity.
- [§5.2] §5.2: several citations to the original VBench paper are given only by name without year or arXiv identifier; adding full references would help readers locate the baseline metrics.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below, explaining our response and the revisions we will incorporate.
read point-by-point responses
-
Referee: [§4.3] §4.3 (Physics dimension evaluation): the claim that the anomaly-detection + VLM pipeline reliably identifies subtle violations (e.g., incorrect object trajectories under gravity or implausible causal sequences) rests on correlation with human raters on a validation subset; this does not demonstrate coverage of edge cases where the same VLMs exhibit known reasoning failures, leaving the central reliability claim under-supported.
Authors: We appreciate the referee's point that correlation on a validation subset does not automatically guarantee coverage of all VLM reasoning failure modes in edge cases. Our human annotation protocol was designed to include diverse physical violation scenarios, and the specialist anomaly detectors were introduced precisely to compensate for known VLM limitations in causal and trajectory reasoning. In the revision we will add an explicit limitations paragraph in §4.3 together with qualitative examples of edge cases where the combined pipeline succeeds or fails, thereby making the reliability argument more transparent. This is a partial revision because we build on the existing human-validated data rather than collecting new annotations. revision: partial
-
Referee: [Table 5] Table 5 (human-automated alignment): the reported Pearson correlations for the Commonsense and Physics dimensions are computed on prompts that appear coarse-grained; without an explicit stress-test on fine-grained violation prompts, it is unclear whether high alignment on the validation set generalizes to the full benchmark distribution.
Authors: We acknowledge that the prompts used for the reported correlations may appear predominantly coarse-grained. To demonstrate generalization, we will add a new appendix section containing a stress-test on a curated set of fine-grained violation prompts (drawn from the same human-annotation pool) and will report the updated Pearson correlations for both dimensions. This addition directly addresses the concern and will be included in the revised manuscript. revision: yes
Circularity Check
No significant circularity; benchmark relies on external VLMs/LLMs with human validation
full rationale
The paper introduces VBench-2.0 as an evaluation framework for video generation models across five dimensions (Human Fidelity, Controllability, Creativity, Physics, Commonsense). It integrates pre-existing SOTA VLMs, LLMs, and anomaly detectors, then aligns them via human annotations. No mathematical derivations, fitted parameters, or predictions appear in the provided text. The reference to prior 'VBench' work is a minor contextual citation and not load-bearing for the central claims, which consist of new dimension definitions and an external evaluation pipeline. No self-definitional loops, fitted-input predictions, or ansatz smuggling are present. The construction is self-contained against external benchmarks and human validation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
LawOfExistencedefect_zero_iff_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
To achieve real “world models” through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 26 Pith papers
-
PhysInOne: Visual Physics Learning and Reasoning in One Suite
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...
-
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
EntityBench is a new benchmark with detailed per-shot entity schedules from real media, and the EntityMem baseline using persistent per-entity memory achieves the highest character fidelity with Cohen's d of +2.33.
-
KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
KVPO aligns streaming autoregressive video generators with human preferences via ODE-native GRPO, using KV cache for semantic exploration and TVE for velocity-based policy modeling, yielding gains in quality and alignment.
-
PhyGround: Benchmarking Physical Reasoning in Generative World Models
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
-
WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models
WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.
-
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
-
AnimationBench: Are Video Models Good at Character-Centric Animation?
AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
-
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
-
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.
-
PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than pr...
-
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
-
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.
-
WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models
WorldJen is a multi-dimensional video generation benchmark using VLM-graded Likert questionnaires on joint prompts, validated to match human three-tier rankings.
-
HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation
HuM-Eval evaluates human motion videos with a coarse-to-fine approach using VLM global checks plus 2D pose and 3D motion analysis, reaching 58.2% average correlation with human judgments and introducing a 1000-prompt ...
-
Seeing Fast and Slow: Learning the Flow of Time in Videos
Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.
-
Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation
Long-CODE isolates long-context video evaluation with a new benchmark dataset and shot-dynamics metric that correlates better with human judgments on narrative richness and global consistency than short-video metrics.
-
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.
-
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks
ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
-
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
-
LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)
The PhyScore challenge creates the first benchmark requiring metrics to jointly score video quality, physical realism, condition alignment, and temporal consistency while localizing physical anomalies in 1554 videos f...
-
Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
Phantom generates visually realistic and physically consistent videos by jointly modeling visual content and latent physical dynamics via an abstract physics-aware representation.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Reference graph
Works this paper leans on
-
[1]
Magicedit: High-fidelity and temporally coherent video editing,
J. H. Liew, H. Yan, J. Zhang, Z. Xu, and J. Feng, “Magicedit: High-fidelity and temporally coherent video editing,” arXiv preprint arXiv:2308.14749, 2023
-
[2]
Stablevideo: Text- driven consistency-aware diffusion video editing,
W. Chai, X. Guo, G. Wang, and Y . Lu, “Stablevideo: Text- driven consistency-aware diffusion video editing,” arXiv preprint arXiv:2308.09592, 2023
-
[3]
Tokenflow: Con- sistent diffusion features for consistent video editing,
M. Geyer, O. Bar-Tal, S. Bagon, and T. Dekel, “Tokenflow: Con- sistent diffusion features for consistent video editing,” arXiv preprint arxiv:2307.10373, 2023
-
[4]
Inve: Interactive neural video editing,
J. Huang, L. Sigal, K. M. Yi, O. Wang, and J.-Y . Lee, “Inve: Interactive neural video editing,” arXiv preprint arXiv:2307.07663 , 2023
-
[5]
Videdit: Zero-shot and spatially aware text-driven video editing,
P. Couairon, C. Rambour, J.-E. Haugeard, and N. Thome, “Videdit: Zero-shot and spatially aware text-driven video editing,” arXiv preprint arXiv:2306.08707, 2023
-
[6]
Video-p2p: Video editing with cross-attention control,
S. Liu, Y . Zhang, W. Li, Z. Lin, and J. Jia, “Video-p2p: Video editing with cross-attention control,” arXiv preprint arXiv:2303.04761 , 2023
-
[7]
Towards consistent video editing with text-to-image diffusion models,
Z. Zhang, B. Li, X. Nie, C. Han, T. Guo, and L. Liu, “Towards consistent video editing with text-to-image diffusion models,” arXiv preprint arXiv:2305.17431, 2023
-
[8]
Controlvideo: Adding conditional control for one shot text-to-video editing,
M. Zhao, R. Wang, F. Bao, C. Li, and J. Zhu, “Controlvideo: Adding conditional control for one shot text-to-video editing,” arXiv preprint arXiv:2305.17098, 2023
-
[9]
Zero-shot video editing using off-the-shelf image diffusion models,
W. Wang, k. Xie, Z. Liu, H. Chen, Y . Cao, X. Wang, and C. Shen, “Zero-shot video editing using off-the-shelf image diffusion models,” arXiv preprint arXiv:2303.17599 , 2023
-
[10]
Pix2video: Video editing using image diffusion,
D. Ceylan, C.-H. P. Huang, and N. J. Mitra, “Pix2video: Video editing using image diffusion,” in ICCV, 2023
work page 2023
-
[11]
Fatezero: Fusing attentions for zero-shot text-based video editing,
C. Qi, X. Cun, Y . Zhang, C. Lei, X. Wang, Y . Shan, and Q. Chen, “Fatezero: Fusing attentions for zero-shot text-based video editing,” arXiv preprint arXiv:2303.09535 , 2023
-
[12]
Shape-aware text-driven layered video editing demo,
Y .-C. Lee, J.-Z. G. J. Jang, Y .-T. Chen, E. Qiu, and J.-B. Huang, “Shape-aware text-driven layered video editing demo,” arXiv preprint arXiv:2301.13173, 2023
-
[13]
Make-a-protagonist: Generic video editing with an ensemble of experts,
Y . Zhao, E. Xie, L. Hong, Z. Li, and G. H. Lee, “Make-a-protagonist: Generic video editing with an ensemble of experts,” arXiv preprint arXiv:2305.08850, 2023
-
[14]
Videograin: Modulating space- time attention for multi-grained video editing,
X. Yang, L. Zhu, H. Fan, and Y . Yang, “Videograin: Modulating space- time attention for multi-grained video editing,” in ICLR, 2025
work page 2025
-
[15]
Multi-concept customization of text-to-image diffusion
N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y . Zhu, “Multi-concept customization of text-to-image diffusion,”arXiv preprint arXiv:2212.04488, 2022
-
[16]
Dreampose: Fashion image-to-video synthesis via stable diffusion,
J. Karras, A. Holynski, T.-C. Wang, and I. Kemelmacher-Shlizerman, “Dreampose: Fashion image-to-video synthesis via stable diffusion,” arXiv preprint arXiv:2304.06025 , 2023
-
[17]
Animate-a-story: Storytelling with retrieval- augmented video generation,
Y . He, M. Xia, H. Chen, X. Cun, Y . Gong, J. Xing, Y . Zhang, X. Wang, C. Weng, Y . Shan et al. , “Animate-a-story: Storytelling with retrieval- augmented video generation,” arXiv preprint arXiv:2307.06940 , 2023
-
[18]
Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,
Y . Guo, C. Yang, A. Rao, Y . Wang, Y . Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” in ICLR, 2024
work page 2024
-
[19]
Dynamicrafter: Animating open-domain images with video diffusion priors,
J. Xing, M. Xia, Y . Zhang, H. Chen, X. Wang, T.-T. Wong, and Y . Shan, “Dynamicrafter: Animating open-domain images with video diffusion priors,” arXiv preprint arXiv:2310.12190 , 2023
-
[20]
Cosmos World Foundation Model Platform for Physical AI
N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopad- hyay, Y . Chen, Y . Cui, Y . Dinget al., “Cosmos world foundation model platform for physical ai,” arXiv preprint arXiv:2501.03575 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Lavie: High-quality video gener- ation with cascaded latent diffusion models
Y . Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y . Wang, C. Yang, Y . He, J. Yu, P. Yanget al., “Lavie: High-quality video generation with cascaded latent diffusion models,” arXiv preprint arXiv:2309.15103 , 2023
-
[22]
ModelScope Text-to-Video Technical Report
J. Wang, H. Yuan, D. Chen, Y . Zhang, X. Wang, and S. Zhang, “Mod- elscope text-to-video technical report,”arXiv preprint arXiv:2308.06571, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
H. Chen, M. Xia, Y . He, Y . Zhang, X. Cun, S. Yang, J. Xing, Y . Liu, Q. Chen, X. Wang, C. Weng, and Y . Shan, “Videocrafter1: Open diffusion models for high-quality video generation,” arXiv preprint arXiv:2310.19512, 2023
work page internal anchor Pith review arXiv 2023
-
[24]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang, “CogVideo: Large- scale pretraining for text-to-video generation via transformers,” arXiv preprint arXiv:2205.15868, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Vbench: Comprehensive benchmark suite for video generative models,
Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisit et al., “Vbench: Comprehensive benchmark suite for video generative models,” in CVPR, 2024
work page 2024
-
[26]
Vbench++: Comprehensive and versatile benchmark suite for video generative models
Z. Huang, F. Zhang, X. Xu, Y . He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y . Jiang, Y . Wang, X. Chen, Y .-C. Chen, L. Wang, D. Lin, Y . Qiao, and Z. Liu, “Vbench++: Comprehensive and versatile benchmark suite for video generative models,” arXiv preprint arXiv:2411.13503 , 2024
-
[27]
Evalcrafter: Benchmarking and evaluating large video generation models,
Y . Liu, X. Cun, X. Liu, X. Wang, Y . Zhang, H. Chen, Y . Liu, T. Zeng, R. Chan, and Y . Shan, “Evalcrafter: Benchmarking and evaluating large video generation models,” in CVPR, 2024
work page 2024
- [28]
-
[29]
Team, “Kling,” Accessed December 9, 2024 [Online] https://klingai
K. Team, “Kling,” Accessed December 9, 2024 [Online] https://klingai. kuaishou.com/, 2024. [Online]. Available: https://klingai.kuaishou.com/
work page 2024
-
[30]
com/research/introducing-gen-3-alpha, 2024
runway, “Gen-3,” Accessed June 17, 2024 [Online] https://runwayml. com/research/introducing-gen-3-alpha, 2024. [Online]. Available: https: //runwayml.com/research/introducing-gen-3-alpha
work page 2024
-
[31]
Hunyuanvideo: A systematic framework for large video generative models,
T. Team, “Hunyuanvideo: A systematic framework for large video generative models,” 2024
work page 2024
-
[32]
G. Team, “Veo2,” Accessed December 18, 2024 [Online] https: //deepmind.google/technologies/veo/veo-2/, 2025. [Online]. Available: https://deepmind.google/technologies/veo/veo-2/
work page 2024
-
[33]
Deep unsupervised learning using nonequilibrium thermodynamics,
J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML, 2015
work page 2015
-
[34]
Score-based generative modeling through stochastic differ- ential equations,
Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differ- ential equations,” in ICLR, 2021
work page 2021
-
[35]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in NeurIPS, 2020
work page 2020
-
[36]
Denoising diffusion implicit models,
J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in ICLR, 2021
work page 2021
-
[37]
Adding conditional control to text-to-image diffusion models,
L. Zhang and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” arXiv preprint arXiv:2302.05543 , 2023
-
[38]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Lettset al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Scaling rectified flow transformers for high-resolution image synthesis,
P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” in ICML, 2024
work page 2024
-
[40]
C. Mou, X. Wang, L. Xie, J. Zhang, Z. Qi, Y . Shan, and X. Qie, “T2i- adapter: Learning adapters to dig out more controllable ability for text- to-image diffusion models,” arXiv preprint arXiv:2302.08453 , 2023
-
[41]
Collaborative diffusion for multi-modal face generation and editing,
Z. Huang, K. C. Chan, Y . Jiang, and Z. Liu, “Collaborative diffusion for multi-modal face generation and editing,” in CVPR, 2023
work page 2023
-
[42]
CogView: Mastering text-to-image generation via transformers,
M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang et al., “CogView: Mastering text-to-image generation via transformers,” in NeurIPS, 2021
work page 2021
-
[43]
Cogview2: Faster and better text-to-image generation via hierarchical transformers,
M. Ding, W. Zheng, W. Hong, and J. Tang, “Cogview2: Faster and better text-to-image generation via hierarchical transformers,” in NeurIPS, 2022
work page 2022
-
[44]
Imagen Video: High Definition Video Generation with Diffusion Models
J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet et al. , “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
Auto-Encoding Variational Bayes
D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[46]
Neural discrete representation learning,
A. Van Den Oord, O. Vinyals et al. , “Neural discrete representation learning,” in NeurIPS, 2017
work page 2017
-
[47]
Taming transformers for high- resolution image synthesis,
P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- resolution image synthesis,” in CVPR, 2021
work page 2021
-
[48]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Magvit: Masked generative video transformer,
L. Yu, Y . Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y . Hao, I. Essa et al. , “Magvit: Masked generative video transformer,” in CVPR, 2023
work page 2023
-
[50]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929 , 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[51]
Scalable Diffusion Models with Transformers
W. Peebles and S. Xie, “Scalable diffusion models with transformers,” arXiv preprint arXiv:2212.09748 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
Videofusion: Decomposed diffusion models for high-quality video generation,
Z. Luo, D. Chen, Y . Zhang, Y . Huang, L. Wang, Y . Shen, D. Zhao, J. Zhou, and T. Tan, “Videofusion: Decomposed diffusion models for high-quality video generation,” in CVPR, 2023. 13
work page 2023
-
[53]
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Y . He, T. Yang, Y . Zhang, Y . Shan, and Q. Chen, “Latent video diffusion models for high-fidelity video generation with arbitrary lengths,” arXiv preprint arXiv:2211.13221, 2022
work page internal anchor Pith review arXiv 2022
-
[54]
Magicvideo: Efficient video generation with latent diffusion models
D. Zhou, W. Wang, H. Yan, W. Lv, Y . Zhu, and J. Feng, “Magicvideo: Efficient video generation with latent diffusion models,” arXiv preprint arXiv:2211.11018, 2023
-
[55]
Show-1: Marrying pixel and latent diffusion models for text-to-video generation
D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y . Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,” arXiv preprint arXiv:2309.15818 , 2023
-
[56]
Preserve your own correlation: A noise prior for video diffusion models,
S. Ge, S. Nah, G. Liu, T. Poon, A. Tao, B. Catanzaro, D. Jacobs, J.- B. Huang, M.-Y . Liu, and Y . Balaji, “Preserve your own correlation: A noise prior for video diffusion models,” in ICCV, 2023
work page 2023
-
[57]
Align your latents: High-resolution video synthesis with latent diffusion models,
A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” in CVPR, 2023
work page 2023
-
[58]
L. Khachatryan, A. Movsisyan, V . Tadevosyan, R. Henschel, Z. Wang, S. Navasardyan, and H. Shi, “Text2video-zero: Text-to-image dif- fusion models are zero-shot video generators,” arXiv preprint arXiv:2303.13439, 2023
-
[59]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng et al., “Cogvideox: Text-to-video diffusion models with an expert transformer,” arXiv preprint arXiv:2408.06072 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Movie Gen: A Cast of Media Foundation Models
A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y . Ma, C.-Y . Chuang, D. Yan, D. Choudhary, D. Wang, G. Sethi, G. Pang, H. Ma, I. Misra, J. Hou, J. Wang, K. Jagadeesh, K. Li, L. Zhang, M. Singh, M. Williamson, M. Le, M. Yu, M. K. Singh, P. Zhang, P. Vajda, Q. Duval, R. Girdhar, R. Sumbaly, S. S. Rambhatla, S. Tsai, S. Aza...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
Wan: Open and advanced large-scale video generative models,
W. Team, “Wan: Open and advanced large-scale video generative models,” 2025
work page 2025
-
[62]
S. Team, 2025. [Online]. Available: https://arxiv.org/abs/2502.10248
-
[63]
Team, “Minmax,” Accessed August 31, 2024 [Online] https: //hailuoai.com/, 2023
M. Team, “Minmax,” Accessed August 31, 2024 [Online] https: //hailuoai.com/, 2023. [Online]. Available: https://hailuoai.com/
work page 2024
-
[64]
Vchitect-2.0: Parallel transformer for scaling up video diffusion models,
W. Fan, C. Si, J. Song, Z. Yang, Y . He, L. Zhuo, Z. Huang, Z. Dong, J. He, D. Pan et al. , “Vchitect-2.0: Parallel transformer for scaling up video diffusion models,” arXiv preprint arXiv:2501.08453 , 2025
-
[65]
Repvideo: Rethink- ing cross-layer representation for video generation,
C. Si, W. Fan, Z. Lv, Z. Huang, Y . Qiao, and Z. Liu, “Repvideo: Rethink- ing cross-layer representation for video generation,” arXiv 2501.08994, 2025
-
[66]
Open-sora 2.0: Training a commercial-level video generation model in $200 k
X. Peng, Z. Zheng, C. Shen, T. Young, X. Guo, B. Wang, H. Xu, H. Liu, M. Jiang, W. Li, Y . Wang, A. Ye, G. Ren, Q. Ma, W. Liang, X. Lian, X. Wu, Y . Zhong, Z. Li, C. Gong, G. Lei, L. Cheng, L. Zhang, M. Li, R. Zhang, S. Hu, S. Huang, X. Wang, Y . Zhao, Y . Wang, Z. Wei, and Y . You, “Open-sora 2.0: Training a commercial-level video generation model in $20...
-
[67]
GANs trained by a two time-scale update rule converge to a local nash equilibrium,
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” in NeurIPS, 2017
work page 2017
-
[68]
Improved techniques for training gans,
T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, X. Chen, and X. Chen, “Improved techniques for training gans,” in NeurIPS, 2016
work page 2016
-
[69]
FVD: A new metric for video generation,
T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “FVD: A new metric for video generation,” in ICLRW, 2019
work page 2019
-
[70]
Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation,
Y . Liu, L. Li, S. Ren, R. Gao, S. Li, S. Chen, X. Sun, and L. Hou, “Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation,” in NeurIPS, 2023
work page 2023
-
[71]
Evaluation agent: Efficient and promptable evaluation framework for visual generative models,
F. Zhang, S. Tian, Z. Huang, Y . Qiao, and Z. Liu, “Evaluation agent: Efficient and promptable evaluation framework for visual generative models,” arXiv preprint arXiv:2412.09645 , 2024
-
[72]
F. Meng, J. Liao, X. Tan, W. Shao, Q. Lu, K. Zhang, Y . Cheng, D. Li, Y . Qiao, and P. Luo, “Towards world simulator: Crafting physical commonsense-based benchmark for video generation,” arXiv preprint arXiv:2410.05363, 2024
-
[73]
K. Sun, K. Huang, X. Liu, Y . Wu, Z. Xu, Z. Li, and X. Liu, “T2v- compbench: A comprehensive benchmark for compositional text-to- video generation,” arXiv preprint arXiv:2407.14505 , 2024
-
[74]
Y . Wang, X. He, K. Wang, L. Ma, J. Yang, S. Wang, S. S. Du, and Y . Shen, “Is your world simulator a good story presenter? a consecutive events-based benchmark for future long video generation,” arXiv preprint arXiv:2412.16211 , 2024
-
[75]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data,”arXiv preprint arXiv:2410.02713, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei et al. , “Qwen2. 5 technical report,” arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[77]
Simmim: A simple framework for masked image modeling,
Z. Xie, Z. Zhang, Y . Cao, Y . Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Simmim: A simple framework for masked image modeling,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 9653–9663
work page 2022
-
[78]
Yolo- world: Real-time open-vocabulary object detection,
T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo- world: Real-time open-vocabulary object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16 901–16 911
work page 2024
-
[79]
G. Fang, W. Yan, Y . Guo, J. Han, Z. Jiang, H. Xu, S. Liao, and X. Liang, “Humanrefiner: Benchmarking abnormal human generation and refining with coarse-to-fine pose-reversible guidance,” in European Conference on Computer Vision . Springer, 2024, pp. 201–217
work page 2024
-
[80]
Arcface: Additive angular margin loss for deep face recognition,
J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in CVPR, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.