pith. sign in

arxiv: 2410.05363 · v1 · pith:IWZ5ZRQFnew · submitted 2024-10-07 · 💻 cs.CV

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Pith reviewed 2026-05-18 14:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-videophysical commonsensebenchmarkintuitive physicsevaluation frameworkvideo generationworld simulator
0
0 comments X p. Extension
pith:IWZ5ZRQF Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{IWZ5ZRQF}

Prints a linked pith:IWZ5ZRQF badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Text-to-video models fail to generate videos that follow basic physical laws, and scaling or prompt tweaks do not fix the gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PhyGenBench, a benchmark of 160 prompts that test 27 distinct physical laws across four domains, to check whether text-to-video models understand intuitive physics. It pairs the benchmark with PhyGenEval, a hierarchical system that chains vision-language models and large language models to score how well generated videos obey physical rules. Large-scale tests on current models show frequent violations, especially in scenes with motion or change. The results indicate that simply making models larger or rewriting prompts leaves these failures intact. The work frames physical commonsense as a necessary step toward building reliable world simulators.

Core claim

The paper shows that existing text-to-video models struggle to produce videos consistent with physical commonsense. PhyGenBench supplies 160 prompts spanning 27 laws in four domains, while PhyGenEval uses a staged pipeline of off-the-shelf vision-language and language models to produce scores that match human judgments of correctness. The evaluation finds clear shortcomings in dynamic cases that persist even after model scaling or prompt engineering.

What carries the argument

PhyGenBench supplies the 160 prompts across 27 physical laws, and PhyGenEval supplies the hierarchical VLM-LLM pipeline that scores physical correctness in generated videos.

If this is right

  • Text-to-video models need explicit mechanisms to learn physical rules rather than relying on scale alone.
  • Automated evaluation frameworks allow repeated testing across many models without repeated human review.
  • Progress toward world simulators requires benchmarks that isolate physical commonsense from general visual quality.
  • Dynamic scenarios remain the hardest category for current models to handle correctly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Success on this benchmark could transfer to better performance in downstream tasks such as robotic simulation or planning.
  • The same prompt-and-evaluation structure might be reused to test physical understanding in image or 3D generation models.
  • If models close the gaps, generated videos could serve as more trustworthy training data for physics-aware AI systems.

Load-bearing premise

The off-the-shelf vision-language and language models in the evaluation pipeline judge physical correctness in videos the same way humans would.

What would settle it

A side-by-side human rating of the same set of generated videos that shows low agreement with the automated PhyGenEval scores would show the evaluation framework does not track human judgment.

read the original abstract

Text-to-video (T2V) models like Sora have made significant strides in visualizing complex prompts, which is increasingly viewed as a promising path towards constructing the universal world simulator. Cognitive psychologists believe that the foundation for achieving this goal is the ability to understand intuitive physics. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. To bridge this gap, we introduce PhyGenBench, a comprehensive \textbf{Phy}sics \textbf{Gen}eration \textbf{Ben}chmark designed to evaluate physical commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully crafted prompts across 27 distinct physical laws, spanning four fundamental domains, which could comprehensively assesses models' understanding of physical commonsense. Alongside PhyGenBench, we propose a novel evaluation framework called PhyGenEval. This framework employs a hierarchical evaluation structure utilizing appropriate advanced vision-language models and large language models to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can conduct large-scale automated assessments of T2V models' understanding of physical commonsense, which align closely with human feedback. Our evaluation results and in-depth analysis demonstrate that current models struggle to generate videos that comply with physical commonsense. Moreover, simply scaling up models or employing prompt engineering techniques is insufficient to fully address the challenges presented by PhyGenBench (e.g., dynamic scenarios). We hope this study will inspire the community to prioritize the learning of physical commonsense in these models beyond entertainment applications. We will release the data and codes at https://github.com/OpenGVLab/PhyGenBench

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents PhyGenBench, a benchmark consisting of 160 human-crafted prompts spanning 27 physical laws across four domains, together with PhyGenEval, a hierarchical pipeline that uses off-the-shelf VLMs and LLMs to score generated videos for compliance with physical commonsense. Using this framework the authors evaluate current text-to-video models, report that they fail to respect intuitive physics (especially in dynamic scenarios), and conclude that neither model scaling nor prompt engineering suffices to close the gap.

Significance. If the automated PhyGenEval scores can be shown to track human physical-correctness judgments, the benchmark would provide a concrete, reproducible testbed that highlights a genuine limitation in current T2V systems and could steer research toward explicit physical modeling rather than pure scaling. The planned public release of prompts, code, and evaluation pipeline is a clear strength that supports future work.

major comments (1)
  1. [Abstract and Evaluation Framework] Abstract and Evaluation Framework section: the central claim that current models 'struggle' and that 'scaling up models or employing prompt engineering techniques is insufficient' rests entirely on PhyGenEval scores. The manuscript asserts these scores 'align closely with human feedback' yet reports no correlation coefficients, inter-rater agreement statistics (e.g., Cohen’s κ or Krippendorff’s α), or protocol details for the human study. Without these numbers it is impossible to assess whether the automated pipeline systematically over- or under-penalizes dynamic scenarios, weakening the evidence for the headline conclusions.
minor comments (2)
  1. [Benchmark Construction] The description of the 27 laws and their assignment to the four domains would benefit from an explicit table or appendix listing each law with one representative prompt and the precise physical principle being tested.
  2. [Results] Figure captions and axis labels in the results section should explicitly state the number of videos evaluated per model and whether error bars reflect prompt-level or video-level variance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential value of PhyGenBench and PhyGenEval as a testbed for physical commonsense in text-to-video models. We agree that quantitative validation of the automated scores against human judgments is essential to support the central claims and will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and Evaluation Framework] Abstract and Evaluation Framework section: the central claim that current models 'struggle' and that 'scaling up models or employing prompt engineering techniques is insufficient' rests entirely on PhyGenEval scores. The manuscript asserts these scores 'align closely with human feedback' yet reports no correlation coefficients, inter-rater agreement statistics (e.g., Cohen’s κ or Krippendorff’s α), or protocol details for the human study. Without these numbers it is impossible to assess whether the automated pipeline systematically over- or under-penalizes dynamic scenarios, weakening the evidence for the headline conclusions.

    Authors: We appreciate this observation and agree that the absence of quantitative validation metrics limits the strength of the evidence. The manuscript currently states that PhyGenEval scores align closely with human feedback but does not report correlation coefficients, inter-rater agreement, or full protocol details. In the revised version we will expand the Evaluation Framework section with a dedicated subsection on human validation. This will include: (1) the full study protocol (number of raters, their background, instructions provided, rating scale, and video presentation method); (2) inter-rater agreement statistics such as Cohen’s κ and Krippendorff’s α; and (3) correlation coefficients (Pearson and Spearman) between PhyGenEval scores and human ratings, with separate analysis for dynamic scenarios. We will add corresponding tables and discussion of any observed discrepancies. These revisions will directly address the concern and better support the claims about model limitations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark and evaluation are externally constructed

full rationale

The paper defines PhyGenBench as 160 human-crafted prompts spanning 27 physical laws and introduces PhyGenEval as a hierarchical pipeline that applies off-the-shelf VLMs and LLMs to score generated videos. The central claim that T2V models fail physical commonsense is measured by applying this external pipeline to the prompts; neither the prompts nor the scoring rules are derived from model outputs or fitted parameters. No equations, self-citations, or uniqueness theorems are invoked in a load-bearing manner, and the evaluation framework remains independent of the tested models. The asserted alignment with human feedback, while lacking detailed quantitative metrics in the provided text, does not reduce any derivation to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on the assumption that the selected 27 laws are representative of intuitive physics and that LLM-based scoring faithfully measures compliance; no free parameters or invented physical entities are introduced.

axioms (2)
  • domain assumption The 27 physical laws chosen across four domains comprehensively represent the intuitive physics needed for world simulation.
    Stated in the abstract as the scope of PhyGenBench; no justification or coverage proof is supplied in the abstract.
  • domain assumption Advanced vision-language models and large language models can be chained to produce physical-correctness scores that align with human feedback.
    Central to PhyGenEval; alignment is asserted but not demonstrated with quantitative human correlation numbers in the abstract.

pith-pipeline@v0.9.0 · 5848 in / 1403 out tokens · 37187 ms · 2026-05-18T14:35:12.714770+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PhyGround: Benchmarking Physical Reasoning in Generative World Models

    cs.CV 2026-05 accept novelty 7.0

    PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.

  2. Do Joint Audio-Video Generation Models Understand Physics?

    cs.SD 2026-05 unverdicted novelty 7.0

    Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.

  3. AnimationBench: Are Video Models Good at Character-Centric Animation?

    cs.CV 2026-04 unverdicted novelty 7.0

    AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.

  4. Do-Undo Bench: Reversibility for Action Understanding in Image Generation

    cs.CV 2025-12 unverdicted novelty 7.0

    Do-Undo Bench is a new evaluation task and dataset that forces models to simulate forward action effects and then undo them to measure genuine action understanding in image generation.

  5. VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?

    cs.CV 2025-12 unverdicted novelty 7.0

    VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.

  6. Quantitative Video World Model Evaluation for Geometric-Consistency

    cs.CV 2026-05 unverdicted novelty 6.0

    PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.

  7. PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

    cs.CV 2026-05 conditional novelty 6.0

    PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than pr...

  8. WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

    cs.CV 2026-05 unverdicted novelty 6.0

    The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.

  9. How Far Are Video Models from True Multimodal Reasoning?

    cs.CV 2026-04 unverdicted novelty 6.0

    Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

  10. SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

    cs.CV 2026-04 unverdicted novelty 6.0

    SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...

  11. CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

    cs.LG 2026-03 unverdicted novelty 6.0

    CellFluxRL post-trains the CellFlux generative model with reinforcement learning driven by biologically meaningful reward functions, yielding virtual cell images that better satisfy physical and biological constraints...

  12. RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

    cs.CV 2025-10 unverdicted novelty 6.0

    RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.

  13. Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility

    cs.CV 2025-09 unverdicted novelty 6.0

    A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.

  14. Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

    cs.RO 2025-08 unverdicted novelty 6.0

    Genie Envisioner unifies robotic policy learning, simulation, and evaluation inside one instruction-conditioned video diffusion framework using GE-Base, GE-Act, and GE-Sim.

  15. MAGI-1: Autoregressive Video Generation at Scale

    cs.CV 2025-05 unverdicted novelty 6.0

    MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

  16. VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    cs.CV 2025-03 accept novelty 6.0

    VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...

  17. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 17 Pith papers · 9 internal anchors

  1. [1]

    URL https://www.pika.art/

    Pika, 2023. URL https://www.pika.art/

  2. [2]

    URL https://runwayml.com/blog/introducing-gen-3-alpha/

    Gen-3, 2024. URL https://runwayml.com/blog/introducing-gen-3-alpha/

  3. [3]

    URL https://kling.kuaishou.com/

    Kling, 2024. URL https://kling.kuaishou.com/

  4. [4]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  5. [5]

    Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. arXiv preprint arXiv:2406.03520, 2024

  6. [6]

    Simulation as an engine of physical scene understanding

    Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences, 110 0 (45): 0 18327--18332, 2013

  7. [7]

    Generating long videos of dynamic scenes

    Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei Efros, and Tero Karras. Generating long videos of dynamic scenes. Advances in Neural Information Processing Systems, 35: 0 31769--31781, 2022

  8. [8]

    Commonsense-t2i challenge: Can text-to-image generation models understand commonsense? arXiv preprint arXiv:2406.07546, 2024

    Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, and Dan Roth. Commonsense-t2i challenge: Can text-to-image generation models understand commonsense? arXiv preprint arXiv:2406.07546, 2024

  9. [9]

    Vista: A generalizable driving world model with high fidelity and versatile controllability

    Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. arXiv preprint arXiv:2405.17398, 2024

  10. [10]

    Fundamentals of physics

    David Halliday, Robert Resnick, and Jearl Walker. Fundamentals of physics. John Wiley & Sons, 2013

  11. [11]

    An interactive e-book for physics to improve students' conceptual mastery

    Ahmad Harjono, Gunawan Gunawan, Rabiatul Adawiyah, and Lovy Herayanti. An interactive e-book for physics to improve students' conceptual mastery. International Journal of Emerging Technologies in Learning (iJET), 15 0 (5): 0 40--49, 2020

  12. [12]

    Venhancer: Generative space-time enhancement for video generation

    Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, and Ziwei Liu. Venhancer: Generative space-time enhancement for video generation. arXiv preprint arXiv:2407.07667, 2024 a

  13. [13]

    Mantisscore: Building automatic metrics to simulate fine-grained human feedback for video generation

    Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Mantisscore: Building automatic metrics to simulate fine-grained human feedback for video generation. arXiv preprint arXiv:2406.15252, 2024 b

  14. [14]

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021

  15. [15]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 21807--21818, 2024

  16. [16]

    Grasp: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models

    Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, and Elia Bruni. Grasp: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models. arXiv preprint arXiv:2311.09048, 2023

  17. [17]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  18. [18]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017

  19. [19]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024

  20. [20]

    Evaluation of text-to-video generation models: A dynamics perspective

    Mingxiang Liao, Hannan Lu, Xinyu Zhang, Fang Wan, Tianyu Wang, Yuzhong Zhao, Wangmeng Zuo, Qixiang Ye, and Jingdong Wang. Evaluation of text-to-video generation models: A dynamics perspective. arXiv preprint arXiv:2407.01094, 2024

  21. [21]

    Evaluating text-to-visual generation with image-to-text generation

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. arXiv preprint arXiv:2404.01291, 2024

  22. [22]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024 a

  23. [23]

    Physgen: Rigid-body physics-grounded image-to-video generation

    Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation. In European Conference on Computer Vision ECCV, 2024 b

  24. [24]

    Evalcrafter: Benchmarking and evaluating large video generation models

    Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22139--22149, 2024 c

  25. [25]

    Multimodal foundation world models for generalist embodied agents

    Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Aaron Courville, and Sai Rajeswar. Multimodal foundation world models for generalist embodied agents. arXiv preprint arXiv:2406.18043, 2024

  26. [26]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

  27. [27]

    Improved techniques for training gans

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016

  28. [28]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    K Soomro. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

  29. [29]

    T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

    Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. arXiv preprint arXiv:2407.14505, 2024

  30. [30]

    The concept of physical law

    Norman Swartz. The concept of physical law. Cambridge University Press, 1985

  31. [31]

    Vidgen-1m: A large-scale dataset for text-to-video generation

    Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. Vidgen-1m: A large-scale dataset for text-to-video generation. arXiv preprint arXiv:2408.02629, 2024

  32. [32]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

  33. [33]

    Qiao, and Ziwei Liu

    Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023

  34. [34]

    Internvideo2: Scaling video foundation models for multimodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024

  35. [35]

    Object permanence in newborn chicks is robust against opposing evidence

    Justin N Wood, Tomer D Ullman, Brian W Wood, Elizabeth S Spelke, and Samantha MW Wood. Object permanence in newborn chicks is robust against opposing evidence. arXiv preprint arXiv:2402.14641, 2024

  36. [36]

    Pandora: Towards general world model with natural language actions and video states

    Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, et al. Pandora: Towards general world model with natural language actions and video states. arXiv preprint arXiv:2406.09455, 2024

  37. [37]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

  38. [38]

    Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation

    Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation. arXiv preprint arXiv:2406.18522, 2024

  39. [39]

    Open-sora: Democratizing efficient video production for all, March 2024

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all, March 2024. URL https://github.com/hpcaitech/Open-Sora

  40. [40]

    Is sora a world simulator? a comprehensive survey on general world models and beyond

    Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Nianchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, Chi Zhang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520, 2024