Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

arxiv: 2410.05363 · v1 · pith:IWZ5ZRQFnew · submitted 2024-10-07 · 💻 cs.CV

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Fanqing Meng , Jiaqi Liao , Xinyu Tan , Wenqi Shao , Quanfeng Lu , Kaipeng Zhang , Yu Cheng , Dianqi Li

show 2 more authors

Yu Qiao Ping Luo

This is my paper

Pith reviewed 2026-05-18 14:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-videophysical commonsensebenchmarkintuitive physicsevaluation frameworkvideo generationworld simulator

0 comments p. Extension

pith:IWZ5ZRQF Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{IWZ5ZRQF}

Prints a linked pith:IWZ5ZRQF badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Text-to-video models fail to generate videos that follow basic physical laws, and scaling or prompt tweaks do not fix the gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PhyGenBench, a benchmark of 160 prompts that test 27 distinct physical laws across four domains, to check whether text-to-video models understand intuitive physics. It pairs the benchmark with PhyGenEval, a hierarchical system that chains vision-language models and large language models to score how well generated videos obey physical rules. Large-scale tests on current models show frequent violations, especially in scenes with motion or change. The results indicate that simply making models larger or rewriting prompts leaves these failures intact. The work frames physical commonsense as a necessary step toward building reliable world simulators.

Core claim

The paper shows that existing text-to-video models struggle to produce videos consistent with physical commonsense. PhyGenBench supplies 160 prompts spanning 27 laws in four domains, while PhyGenEval uses a staged pipeline of off-the-shelf vision-language and language models to produce scores that match human judgments of correctness. The evaluation finds clear shortcomings in dynamic cases that persist even after model scaling or prompt engineering.

What carries the argument

PhyGenBench supplies the 160 prompts across 27 physical laws, and PhyGenEval supplies the hierarchical VLM-LLM pipeline that scores physical correctness in generated videos.

If this is right

Text-to-video models need explicit mechanisms to learn physical rules rather than relying on scale alone.
Automated evaluation frameworks allow repeated testing across many models without repeated human review.
Progress toward world simulators requires benchmarks that isolate physical commonsense from general visual quality.
Dynamic scenarios remain the hardest category for current models to handle correctly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Success on this benchmark could transfer to better performance in downstream tasks such as robotic simulation or planning.
The same prompt-and-evaluation structure might be reused to test physical understanding in image or 3D generation models.
If models close the gaps, generated videos could serve as more trustworthy training data for physics-aware AI systems.

Load-bearing premise

The off-the-shelf vision-language and language models in the evaluation pipeline judge physical correctness in videos the same way humans would.

What would settle it

A side-by-side human rating of the same set of generated videos that shows low agreement with the automated PhyGenEval scores would show the evaluation framework does not track human judgment.

read the original abstract

Text-to-video (T2V) models like Sora have made significant strides in visualizing complex prompts, which is increasingly viewed as a promising path towards constructing the universal world simulator. Cognitive psychologists believe that the foundation for achieving this goal is the ability to understand intuitive physics. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. To bridge this gap, we introduce PhyGenBench, a comprehensive \textbf{Phy}sics \textbf{Gen}eration \textbf{Ben}chmark designed to evaluate physical commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully crafted prompts across 27 distinct physical laws, spanning four fundamental domains, which could comprehensively assesses models' understanding of physical commonsense. Alongside PhyGenBench, we propose a novel evaluation framework called PhyGenEval. This framework employs a hierarchical evaluation structure utilizing appropriate advanced vision-language models and large language models to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can conduct large-scale automated assessments of T2V models' understanding of physical commonsense, which align closely with human feedback. Our evaluation results and in-depth analysis demonstrate that current models struggle to generate videos that comply with physical commonsense. Moreover, simply scaling up models or employing prompt engineering techniques is insufficient to fully address the challenges presented by PhyGenBench (e.g., dynamic scenarios). We hope this study will inspire the community to prioritize the learning of physical commonsense in these models beyond entertainment applications. We will release the data and codes at https://github.com/OpenGVLab/PhyGenBench

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhyGenBench gives a concrete new prompt set and scoring pipeline for physical laws in video generation, but the human alignment of that pipeline lacks the numbers needed to back the strong claims about model failures.

read the letter

Hi, the main point here is that the paper puts forward PhyGenBench with 160 prompts across 27 physical laws and a hierarchical PhyGenEval setup that uses VLMs and LLMs to score generated videos. The results indicate current text-to-video models fall short on these tests, and that simply making models larger or tweaking prompts does not close the gap, especially for dynamic cases. If the scores are reliable, this points to a real limitation in how these systems acquire intuitive physics for world simulation tasks. They do a decent job making the benchmark explicit and committing to release the prompts and code, which lets others apply it directly without starting from scratch. The hierarchical breakdown also adds some structure compared to flatter evaluation approaches in related image or 3D work. The softer spot is exactly the one the stress-test note flags. The abstract claims the automated scores align closely with human feedback, but there are no correlation figures, inter-rater numbers, or protocol details shown. Without those, it is hard to know whether the pipeline systematically misjudges certain laws or dynamic scenarios, which undercuts how firmly we can conclude that scaling is insufficient. This is aimed at groups building or testing generative models for robotics, planning, or other applications that need physical accuracy rather than just visual appeal. Someone looking for a ready-made test suite for intuitive physics violations could use the prompts productively even before the scoring details are tightened. I would send it to peer review. The benchmark construction is solid and specific enough to deserve referee attention, particularly on the human validation side.

Referee Report

1 major / 2 minor

Summary. The paper presents PhyGenBench, a benchmark consisting of 160 human-crafted prompts spanning 27 physical laws across four domains, together with PhyGenEval, a hierarchical pipeline that uses off-the-shelf VLMs and LLMs to score generated videos for compliance with physical commonsense. Using this framework the authors evaluate current text-to-video models, report that they fail to respect intuitive physics (especially in dynamic scenarios), and conclude that neither model scaling nor prompt engineering suffices to close the gap.

Significance. If the automated PhyGenEval scores can be shown to track human physical-correctness judgments, the benchmark would provide a concrete, reproducible testbed that highlights a genuine limitation in current T2V systems and could steer research toward explicit physical modeling rather than pure scaling. The planned public release of prompts, code, and evaluation pipeline is a clear strength that supports future work.

major comments (1)

[Abstract and Evaluation Framework] Abstract and Evaluation Framework section: the central claim that current models 'struggle' and that 'scaling up models or employing prompt engineering techniques is insufficient' rests entirely on PhyGenEval scores. The manuscript asserts these scores 'align closely with human feedback' yet reports no correlation coefficients, inter-rater agreement statistics (e.g., Cohen’s κ or Krippendorff’s α), or protocol details for the human study. Without these numbers it is impossible to assess whether the automated pipeline systematically over- or under-penalizes dynamic scenarios, weakening the evidence for the headline conclusions.

minor comments (2)

[Benchmark Construction] The description of the 27 laws and their assignment to the four domains would benefit from an explicit table or appendix listing each law with one representative prompt and the precise physical principle being tested.
[Results] Figure captions and axis labels in the results section should explicitly state the number of videos evaluated per model and whether error bars reflect prompt-level or video-level variance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential value of PhyGenBench and PhyGenEval as a testbed for physical commonsense in text-to-video models. We agree that quantitative validation of the automated scores against human judgments is essential to support the central claims and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and Evaluation Framework] Abstract and Evaluation Framework section: the central claim that current models 'struggle' and that 'scaling up models or employing prompt engineering techniques is insufficient' rests entirely on PhyGenEval scores. The manuscript asserts these scores 'align closely with human feedback' yet reports no correlation coefficients, inter-rater agreement statistics (e.g., Cohen’s κ or Krippendorff’s α), or protocol details for the human study. Without these numbers it is impossible to assess whether the automated pipeline systematically over- or under-penalizes dynamic scenarios, weakening the evidence for the headline conclusions.

Authors: We appreciate this observation and agree that the absence of quantitative validation metrics limits the strength of the evidence. The manuscript currently states that PhyGenEval scores align closely with human feedback but does not report correlation coefficients, inter-rater agreement, or full protocol details. In the revised version we will expand the Evaluation Framework section with a dedicated subsection on human validation. This will include: (1) the full study protocol (number of raters, their background, instructions provided, rating scale, and video presentation method); (2) inter-rater agreement statistics such as Cohen’s κ and Krippendorff’s α; and (3) correlation coefficients (Pearson and Spearman) between PhyGenEval scores and human ratings, with separate analysis for dynamic scenarios. We will add corresponding tables and discussion of any observed discrepancies. These revisions will directly address the concern and better support the claims about model limitations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark and evaluation are externally constructed

full rationale

The paper defines PhyGenBench as 160 human-crafted prompts spanning 27 physical laws and introduces PhyGenEval as a hierarchical pipeline that applies off-the-shelf VLMs and LLMs to score generated videos. The central claim that T2V models fail physical commonsense is measured by applying this external pipeline to the prompts; neither the prompts nor the scoring rules are derived from model outputs or fitted parameters. No equations, self-citations, or uniqueness theorems are invoked in a load-bearing manner, and the evaluation framework remains independent of the tested models. The asserted alignment with human feedback, while lacking detailed quantitative metrics in the provided text, does not reduce any derivation to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on the assumption that the selected 27 laws are representative of intuitive physics and that LLM-based scoring faithfully measures compliance; no free parameters or invented physical entities are introduced.

axioms (2)

domain assumption The 27 physical laws chosen across four domains comprehensively represent the intuitive physics needed for world simulation.
Stated in the abstract as the scope of PhyGenBench; no justification or coverage proof is supplied in the abstract.
domain assumption Advanced vision-language models and large language models can be chained to produce physical-correctness scores that align with human feedback.
Central to PhyGenEval; alignment is asserted but not demonstrated with quantitative human correlation numbers in the abstract.

pith-pipeline@v0.9.0 · 5848 in / 1403 out tokens · 37187 ms · 2026-05-18T14:35:12.714770+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PhyGround: Benchmarking Physical Reasoning in Generative World Models
cs.CV 2026-05 accept novelty 7.0

PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
Do Joint Audio-Video Generation Models Understand Physics?
cs.SD 2026-05 unverdicted novelty 7.0

Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
AnimationBench: Are Video Models Good at Character-Centric Animation?
cs.CV 2026-04 unverdicted novelty 7.0

AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
Do-Undo Bench: Reversibility for Action Understanding in Image Generation
cs.CV 2025-12 unverdicted novelty 7.0

Do-Undo Bench is a new evaluation task and dataset that forces models to simulate forward action effects and then undo them to measure genuine action understanding in image generation.
VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?
cs.CV 2025-12 unverdicted novelty 7.0

VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.
Quantitative Video World Model Evaluation for Geometric-Consistency
cs.CV 2026-05 unverdicted novelty 6.0

PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.
PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
cs.CV 2026-05 conditional novelty 6.0

PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than pr...
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
cs.CV 2026-05 unverdicted novelty 6.0

The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
How Far Are Video Models from True Multimodal Reasoning?
cs.CV 2026-04 unverdicted novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
cs.CV 2026-04 unverdicted novelty 6.0

SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...
CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning
cs.LG 2026-03 unverdicted novelty 6.0

CellFluxRL post-trains the CellFlux generative model with reinforcement learning driven by biologically meaningful reward functions, yielding virtual cell images that better satisfy physical and biological constraints...
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
cs.CV 2025-10 unverdicted novelty 6.0

RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility
cs.CV 2025-09 unverdicted novelty 6.0

A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
cs.RO 2025-08 unverdicted novelty 6.0

Genie Envisioner unifies robotic policy learning, simulation, and evaluation inside one instruction-conditioned video diffusion framework using GE-Base, GE-Act, and GE-Sim.
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
cs.CV 2025-03 accept novelty 6.0

VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 17 Pith papers · 9 internal anchors

[1]

URL https://www.pika.art/

Pika, 2023. URL https://www.pika.art/

work page 2023
[2]

URL https://runwayml.com/blog/introducing-gen-3-alpha/

Gen-3, 2024. URL https://runwayml.com/blog/introducing-gen-3-alpha/

work page 2024
[3]

URL https://kling.kuaishou.com/

Kling, 2024. URL https://kling.kuaishou.com/

work page 2024
[4]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. arXiv preprint arXiv:2406.03520, 2024

work page arXiv 2024
[6]

Simulation as an engine of physical scene understanding

Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences, 110 0 (45): 0 18327--18332, 2013

work page 2013
[7]

Generating long videos of dynamic scenes

Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei Efros, and Tero Karras. Generating long videos of dynamic scenes. Advances in Neural Information Processing Systems, 35: 0 31769--31781, 2022

work page 2022
[8]

Commonsense-t2i challenge: Can text-to-image generation models understand commonsense? arXiv preprint arXiv:2406.07546, 2024

Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, and Dan Roth. Commonsense-t2i challenge: Can text-to-image generation models understand commonsense? arXiv preprint arXiv:2406.07546, 2024

work page arXiv 2024
[9]

Vista: A generalizable driving world model with high fidelity and versatile controllability

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. arXiv preprint arXiv:2405.17398, 2024

work page arXiv 2024
[10]

Fundamentals of physics

David Halliday, Robert Resnick, and Jearl Walker. Fundamentals of physics. John Wiley & Sons, 2013

work page 2013
[11]

An interactive e-book for physics to improve students' conceptual mastery

Ahmad Harjono, Gunawan Gunawan, Rabiatul Adawiyah, and Lovy Herayanti. An interactive e-book for physics to improve students' conceptual mastery. International Journal of Emerging Technologies in Learning (iJET), 15 0 (5): 0 40--49, 2020

work page 2020
[12]

Venhancer: Generative space-time enhancement for video generation

Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, and Ziwei Liu. Venhancer: Generative space-time enhancement for video generation. arXiv preprint arXiv:2407.07667, 2024 a

work page arXiv 2024
[13]

Mantisscore: Building automatic metrics to simulate fine-grained human feedback for video generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Mantisscore: Building automatic metrics to simulate fine-grained human feedback for video generation. arXiv preprint arXiv:2406.15252, 2024 b

work page arXiv 2024
[14]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 21807--21818, 2024

work page 2024
[16]

Grasp: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models

Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, and Elia Bruni. Grasp: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models. arXiv preprint arXiv:2311.09048, 2023

work page arXiv 2023
[17]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[18]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Evaluation of text-to-video generation models: A dynamics perspective

Mingxiang Liao, Hannan Lu, Xinyu Zhang, Fang Wan, Tianyu Wang, Yuzhong Zhao, Wangmeng Zuo, Qixiang Ye, and Jingdong Wang. Evaluation of text-to-video generation models: A dynamics perspective. arXiv preprint arXiv:2407.01094, 2024

work page arXiv 2024
[21]

Evaluating text-to-visual generation with image-to-text generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. arXiv preprint arXiv:2404.01291, 2024

work page arXiv 2024
[22]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024 a

work page 2024
[23]

Physgen: Rigid-body physics-grounded image-to-video generation

Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation. In European Conference on Computer Vision ECCV, 2024 b

work page 2024
[24]

Evalcrafter: Benchmarking and evaluating large video generation models

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22139--22149, 2024 c

work page 2024
[25]

Multimodal foundation world models for generalist embodied agents

Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Aaron Courville, and Sai Rajeswar. Multimodal foundation world models for generalist embodied agents. arXiv preprint arXiv:2406.18043, 2024

work page arXiv 2024
[26]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016

work page 2016
[28]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K Soomro. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[29]

T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. arXiv preprint arXiv:2407.14505, 2024

work page arXiv 2024
[30]

The concept of physical law

Norman Swartz. The concept of physical law. Cambridge University Press, 1985

work page 1985
[31]

Vidgen-1m: A large-scale dataset for text-to-video generation

Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. Vidgen-1m: A large-scale dataset for text-to-video generation. arXiv preprint arXiv:2408.02629, 2024

work page arXiv 2024
[32]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[33]

Qiao, and Ziwei Liu

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023

work page arXiv 2023
[34]

Internvideo2: Scaling video foundation models for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024

work page arXiv 2024
[35]

Object permanence in newborn chicks is robust against opposing evidence

Justin N Wood, Tomer D Ullman, Brian W Wood, Elizabeth S Spelke, and Samantha MW Wood. Object permanence in newborn chicks is robust against opposing evidence. arXiv preprint arXiv:2402.14641, 2024

work page arXiv 2024
[36]

Pandora: Towards general world model with natural language actions and video states

Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, et al. Pandora: Towards general world model with natural language actions and video states. arXiv preprint arXiv:2406.09455, 2024

work page arXiv 2024
[37]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation

Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation. arXiv preprint arXiv:2406.18522, 2024

work page arXiv 2024
[39]

Open-sora: Democratizing efficient video production for all, March 2024

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all, March 2024. URL https://github.com/hpcaitech/Open-Sora

work page 2024
[40]

Is sora a world simulator? a comprehensive survey on general world models and beyond

Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Nianchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, Chi Zhang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520, 2024

work page arXiv 2024

[1] [1]

URL https://www.pika.art/

Pika, 2023. URL https://www.pika.art/

work page 2023

[2] [2]

URL https://runwayml.com/blog/introducing-gen-3-alpha/

Gen-3, 2024. URL https://runwayml.com/blog/introducing-gen-3-alpha/

work page 2024

[3] [3]

URL https://kling.kuaishou.com/

Kling, 2024. URL https://kling.kuaishou.com/

work page 2024

[4] [4]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. arXiv preprint arXiv:2406.03520, 2024

work page arXiv 2024

[6] [6]

Simulation as an engine of physical scene understanding

Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences, 110 0 (45): 0 18327--18332, 2013

work page 2013

[7] [7]

Generating long videos of dynamic scenes

Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei Efros, and Tero Karras. Generating long videos of dynamic scenes. Advances in Neural Information Processing Systems, 35: 0 31769--31781, 2022

work page 2022

[8] [8]

Commonsense-t2i challenge: Can text-to-image generation models understand commonsense? arXiv preprint arXiv:2406.07546, 2024

Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, and Dan Roth. Commonsense-t2i challenge: Can text-to-image generation models understand commonsense? arXiv preprint arXiv:2406.07546, 2024

work page arXiv 2024

[9] [9]

Vista: A generalizable driving world model with high fidelity and versatile controllability

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. arXiv preprint arXiv:2405.17398, 2024

work page arXiv 2024

[10] [10]

Fundamentals of physics

David Halliday, Robert Resnick, and Jearl Walker. Fundamentals of physics. John Wiley & Sons, 2013

work page 2013

[11] [11]

An interactive e-book for physics to improve students' conceptual mastery

Ahmad Harjono, Gunawan Gunawan, Rabiatul Adawiyah, and Lovy Herayanti. An interactive e-book for physics to improve students' conceptual mastery. International Journal of Emerging Technologies in Learning (iJET), 15 0 (5): 0 40--49, 2020

work page 2020

[12] [12]

Venhancer: Generative space-time enhancement for video generation

Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, and Ziwei Liu. Venhancer: Generative space-time enhancement for video generation. arXiv preprint arXiv:2407.07667, 2024 a

work page arXiv 2024

[13] [13]

Mantisscore: Building automatic metrics to simulate fine-grained human feedback for video generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Mantisscore: Building automatic metrics to simulate fine-grained human feedback for video generation. arXiv preprint arXiv:2406.15252, 2024 b

work page arXiv 2024

[14] [14]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[15] [15]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 21807--21818, 2024

work page 2024

[16] [16]

Grasp: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models

Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, and Elia Bruni. Grasp: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models. arXiv preprint arXiv:2311.09048, 2023

work page arXiv 2023

[17] [17]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[18] [18]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Evaluation of text-to-video generation models: A dynamics perspective

Mingxiang Liao, Hannan Lu, Xinyu Zhang, Fang Wan, Tianyu Wang, Yuzhong Zhao, Wangmeng Zuo, Qixiang Ye, and Jingdong Wang. Evaluation of text-to-video generation models: A dynamics perspective. arXiv preprint arXiv:2407.01094, 2024

work page arXiv 2024

[21] [21]

Evaluating text-to-visual generation with image-to-text generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. arXiv preprint arXiv:2404.01291, 2024

work page arXiv 2024

[22] [22]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024 a

work page 2024

[23] [23]

Physgen: Rigid-body physics-grounded image-to-video generation

Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation. In European Conference on Computer Vision ECCV, 2024 b

work page 2024

[24] [24]

Evalcrafter: Benchmarking and evaluating large video generation models

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22139--22149, 2024 c

work page 2024

[25] [25]

Multimodal foundation world models for generalist embodied agents

Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Aaron Courville, and Sai Rajeswar. Multimodal foundation world models for generalist embodied agents. arXiv preprint arXiv:2406.18043, 2024

work page arXiv 2024

[26] [26]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016

work page 2016

[28] [28]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

K Soomro. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[29] [29]

T2v-compbench: A comprehensive benchmark for compositional text-to-video generation

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. arXiv preprint arXiv:2407.14505, 2024

work page arXiv 2024

[30] [30]

The concept of physical law

Norman Swartz. The concept of physical law. Cambridge University Press, 1985

work page 1985

[31] [31]

Vidgen-1m: A large-scale dataset for text-to-video generation

Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. Vidgen-1m: A large-scale dataset for text-to-video generation. arXiv preprint arXiv:2408.02629, 2024

work page arXiv 2024

[32] [32]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[33] [33]

Qiao, and Ziwei Liu

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023

work page arXiv 2023

[34] [34]

Internvideo2: Scaling video foundation models for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024

work page arXiv 2024

[35] [35]

Object permanence in newborn chicks is robust against opposing evidence

Justin N Wood, Tomer D Ullman, Brian W Wood, Elizabeth S Spelke, and Samantha MW Wood. Object permanence in newborn chicks is robust against opposing evidence. arXiv preprint arXiv:2402.14641, 2024

work page arXiv 2024

[36] [36]

Pandora: Towards general world model with natural language actions and video states

Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, et al. Pandora: Towards general world model with natural language actions and video states. arXiv preprint arXiv:2406.09455, 2024

work page arXiv 2024

[37] [37]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation

Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation. arXiv preprint arXiv:2406.18522, 2024

work page arXiv 2024

[39] [39]

Open-sora: Democratizing efficient video production for all, March 2024

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all, March 2024. URL https://github.com/hpcaitech/Open-Sora

work page 2024

[40] [40]

Is sora a world simulator? a comprehensive survey on general world models and beyond

Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Nianchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, Chi Zhang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520, 2024

work page arXiv 2024