Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Pith reviewed 2026-05-18 14:35 UTC · model grok-4.3
pith:IWZ5ZRQF Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{IWZ5ZRQF}
Prints a linked pith:IWZ5ZRQF badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Text-to-video models fail to generate videos that follow basic physical laws, and scaling or prompt tweaks do not fix the gaps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that existing text-to-video models struggle to produce videos consistent with physical commonsense. PhyGenBench supplies 160 prompts spanning 27 laws in four domains, while PhyGenEval uses a staged pipeline of off-the-shelf vision-language and language models to produce scores that match human judgments of correctness. The evaluation finds clear shortcomings in dynamic cases that persist even after model scaling or prompt engineering.
What carries the argument
PhyGenBench supplies the 160 prompts across 27 physical laws, and PhyGenEval supplies the hierarchical VLM-LLM pipeline that scores physical correctness in generated videos.
If this is right
- Text-to-video models need explicit mechanisms to learn physical rules rather than relying on scale alone.
- Automated evaluation frameworks allow repeated testing across many models without repeated human review.
- Progress toward world simulators requires benchmarks that isolate physical commonsense from general visual quality.
- Dynamic scenarios remain the hardest category for current models to handle correctly.
Where Pith is reading between the lines
- Success on this benchmark could transfer to better performance in downstream tasks such as robotic simulation or planning.
- The same prompt-and-evaluation structure might be reused to test physical understanding in image or 3D generation models.
- If models close the gaps, generated videos could serve as more trustworthy training data for physics-aware AI systems.
Load-bearing premise
The off-the-shelf vision-language and language models in the evaluation pipeline judge physical correctness in videos the same way humans would.
What would settle it
A side-by-side human rating of the same set of generated videos that shows low agreement with the automated PhyGenEval scores would show the evaluation framework does not track human judgment.
read the original abstract
Text-to-video (T2V) models like Sora have made significant strides in visualizing complex prompts, which is increasingly viewed as a promising path towards constructing the universal world simulator. Cognitive psychologists believe that the foundation for achieving this goal is the ability to understand intuitive physics. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. To bridge this gap, we introduce PhyGenBench, a comprehensive \textbf{Phy}sics \textbf{Gen}eration \textbf{Ben}chmark designed to evaluate physical commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully crafted prompts across 27 distinct physical laws, spanning four fundamental domains, which could comprehensively assesses models' understanding of physical commonsense. Alongside PhyGenBench, we propose a novel evaluation framework called PhyGenEval. This framework employs a hierarchical evaluation structure utilizing appropriate advanced vision-language models and large language models to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can conduct large-scale automated assessments of T2V models' understanding of physical commonsense, which align closely with human feedback. Our evaluation results and in-depth analysis demonstrate that current models struggle to generate videos that comply with physical commonsense. Moreover, simply scaling up models or employing prompt engineering techniques is insufficient to fully address the challenges presented by PhyGenBench (e.g., dynamic scenarios). We hope this study will inspire the community to prioritize the learning of physical commonsense in these models beyond entertainment applications. We will release the data and codes at https://github.com/OpenGVLab/PhyGenBench
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents PhyGenBench, a benchmark consisting of 160 human-crafted prompts spanning 27 physical laws across four domains, together with PhyGenEval, a hierarchical pipeline that uses off-the-shelf VLMs and LLMs to score generated videos for compliance with physical commonsense. Using this framework the authors evaluate current text-to-video models, report that they fail to respect intuitive physics (especially in dynamic scenarios), and conclude that neither model scaling nor prompt engineering suffices to close the gap.
Significance. If the automated PhyGenEval scores can be shown to track human physical-correctness judgments, the benchmark would provide a concrete, reproducible testbed that highlights a genuine limitation in current T2V systems and could steer research toward explicit physical modeling rather than pure scaling. The planned public release of prompts, code, and evaluation pipeline is a clear strength that supports future work.
major comments (1)
- [Abstract and Evaluation Framework] Abstract and Evaluation Framework section: the central claim that current models 'struggle' and that 'scaling up models or employing prompt engineering techniques is insufficient' rests entirely on PhyGenEval scores. The manuscript asserts these scores 'align closely with human feedback' yet reports no correlation coefficients, inter-rater agreement statistics (e.g., Cohen’s κ or Krippendorff’s α), or protocol details for the human study. Without these numbers it is impossible to assess whether the automated pipeline systematically over- or under-penalizes dynamic scenarios, weakening the evidence for the headline conclusions.
minor comments (2)
- [Benchmark Construction] The description of the 27 laws and their assignment to the four domains would benefit from an explicit table or appendix listing each law with one representative prompt and the precise physical principle being tested.
- [Results] Figure captions and axis labels in the results section should explicitly state the number of videos evaluated per model and whether error bars reflect prompt-level or video-level variance.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential value of PhyGenBench and PhyGenEval as a testbed for physical commonsense in text-to-video models. We agree that quantitative validation of the automated scores against human judgments is essential to support the central claims and will strengthen the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and Evaluation Framework] Abstract and Evaluation Framework section: the central claim that current models 'struggle' and that 'scaling up models or employing prompt engineering techniques is insufficient' rests entirely on PhyGenEval scores. The manuscript asserts these scores 'align closely with human feedback' yet reports no correlation coefficients, inter-rater agreement statistics (e.g., Cohen’s κ or Krippendorff’s α), or protocol details for the human study. Without these numbers it is impossible to assess whether the automated pipeline systematically over- or under-penalizes dynamic scenarios, weakening the evidence for the headline conclusions.
Authors: We appreciate this observation and agree that the absence of quantitative validation metrics limits the strength of the evidence. The manuscript currently states that PhyGenEval scores align closely with human feedback but does not report correlation coefficients, inter-rater agreement, or full protocol details. In the revised version we will expand the Evaluation Framework section with a dedicated subsection on human validation. This will include: (1) the full study protocol (number of raters, their background, instructions provided, rating scale, and video presentation method); (2) inter-rater agreement statistics such as Cohen’s κ and Krippendorff’s α; and (3) correlation coefficients (Pearson and Spearman) between PhyGenEval scores and human ratings, with separate analysis for dynamic scenarios. We will add corresponding tables and discussion of any observed discrepancies. These revisions will directly address the concern and better support the claims about model limitations. revision: yes
Circularity Check
No significant circularity; benchmark and evaluation are externally constructed
full rationale
The paper defines PhyGenBench as 160 human-crafted prompts spanning 27 physical laws and introduces PhyGenEval as a hierarchical pipeline that applies off-the-shelf VLMs and LLMs to score generated videos. The central claim that T2V models fail physical commonsense is measured by applying this external pipeline to the prompts; neither the prompts nor the scoring rules are derived from model outputs or fitted parameters. No equations, self-citations, or uniqueness theorems are invoked in a load-bearing manner, and the evaluation framework remains independent of the tested models. The asserted alignment with human feedback, while lacking detailed quantitative metrics in the provided text, does not reduce any derivation to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The 27 physical laws chosen across four domains comprehensively represent the intuitive physics needed for world simulation.
- domain assumption Advanced vision-language models and large language models can be chained to produce physical-correctness scores that align with human feedback.
Forward citations
Cited by 17 Pith papers
-
PhyGround: Benchmarking Physical Reasoning in Generative World Models
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
-
Do Joint Audio-Video Generation Models Understand Physics?
Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
-
AnimationBench: Are Video Models Good at Character-Centric Animation?
AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
-
Do-Undo Bench: Reversibility for Action Understanding in Image Generation
Do-Undo Bench is a new evaluation task and dataset that forces models to simulate forward action effects and then undo them to measure genuine action understanding in image generation.
-
VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?
VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.
-
Quantitative Video World Model Evaluation for Geometric-Consistency
PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.
-
PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than pr...
-
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
-
How Far Are Video Models from True Multimodal Reasoning?
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
-
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...
-
CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning
CellFluxRL post-trains the CellFlux generative model with reinforcement learning driven by biologically meaningful reward functions, yielding virtual cell images that better satisfy physical and biological constraints...
-
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
-
Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility
A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.
-
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
Genie Envisioner unifies robotic policy learning, simulation, and evaluation inside one instruction-conditioned video diffusion framework using GE-Base, GE-Act, and GE-Sim.
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
-
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Reference graph
Works this paper leans on
- [1]
-
[2]
URL https://runwayml.com/blog/introducing-gen-3-alpha/
Gen-3, 2024. URL https://runwayml.com/blog/introducing-gen-3-alpha/
work page 2024
- [3]
-
[4]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,
Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. arXiv preprint arXiv:2406.03520, 2024
-
[6]
Simulation as an engine of physical scene understanding
Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences, 110 0 (45): 0 18327--18332, 2013
work page 2013
-
[7]
Generating long videos of dynamic scenes
Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei Efros, and Tero Karras. Generating long videos of dynamic scenes. Advances in Neural Information Processing Systems, 35: 0 31769--31781, 2022
work page 2022
-
[8]
Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, and Dan Roth. Commonsense-t2i challenge: Can text-to-image generation models understand commonsense? arXiv preprint arXiv:2406.07546, 2024
-
[9]
Vista: A generalizable driving world model with high fidelity and versatile controllability
Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. arXiv preprint arXiv:2405.17398, 2024
-
[10]
David Halliday, Robert Resnick, and Jearl Walker. Fundamentals of physics. John Wiley & Sons, 2013
work page 2013
-
[11]
An interactive e-book for physics to improve students' conceptual mastery
Ahmad Harjono, Gunawan Gunawan, Rabiatul Adawiyah, and Lovy Herayanti. An interactive e-book for physics to improve students' conceptual mastery. International Journal of Emerging Technologies in Learning (iJET), 15 0 (5): 0 40--49, 2020
work page 2020
-
[12]
Venhancer: Generative space-time enhancement for video generation
Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, and Ziwei Liu. Venhancer: Generative space-time enhancement for video generation. arXiv preprint arXiv:2407.07667, 2024 a
-
[13]
Mantisscore: Building automatic metrics to simulate fine-grained human feedback for video generation
Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al. Mantisscore: Building automatic metrics to simulate fine-grained human feedback for video generation. arXiv preprint arXiv:2406.15252, 2024 b
-
[14]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 21807--21818, 2024
work page 2024
-
[16]
Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, and Elia Bruni. Grasp: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models. arXiv preprint arXiv:2311.09048, 2023
-
[17]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[18]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[19]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Evaluation of text-to-video generation models: A dynamics perspective
Mingxiang Liao, Hannan Lu, Xinyu Zhang, Fang Wan, Tianyu Wang, Yuzhong Zhao, Wangmeng Zuo, Qixiang Ye, and Jingdong Wang. Evaluation of text-to-video generation models: A dynamics perspective. arXiv preprint arXiv:2407.01094, 2024
-
[21]
Evaluating text-to-visual generation with image-to-text generation
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. arXiv preprint arXiv:2404.01291, 2024
-
[22]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024 a
work page 2024
-
[23]
Physgen: Rigid-body physics-grounded image-to-video generation
Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation. In European Conference on Computer Vision ECCV, 2024 b
work page 2024
-
[24]
Evalcrafter: Benchmarking and evaluating large video generation models
Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22139--22149, 2024 c
work page 2024
-
[25]
Multimodal foundation world models for generalist embodied agents
Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Aaron Courville, and Sai Rajeswar. Multimodal foundation world models for generalist embodied agents. arXiv preprint arXiv:2406.18043, 2024
-
[26]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Improved techniques for training gans
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016
work page 2016
-
[28]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
K Soomro. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[29]
T2v-compbench: A comprehensive benchmark for compositional text-to-video generation
Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. arXiv preprint arXiv:2407.14505, 2024
-
[30]
Norman Swartz. The concept of physical law. Cambridge University Press, 1985
work page 1985
-
[31]
Vidgen-1m: A large-scale dataset for text-to-video generation
Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. Vidgen-1m: A large-scale dataset for text-to-video generation. arXiv preprint arXiv:2408.02629, 2024
-
[32]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[33]
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023
-
[34]
Internvideo2: Scaling video foundation models for multimodal video understanding
Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024
-
[35]
Object permanence in newborn chicks is robust against opposing evidence
Justin N Wood, Tomer D Ullman, Brian W Wood, Elizabeth S Spelke, and Samantha MW Wood. Object permanence in newborn chicks is robust against opposing evidence. arXiv preprint arXiv:2402.14641, 2024
-
[36]
Pandora: Towards general world model with natural language actions and video states
Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, et al. Pandora: Towards general world model with natural language actions and video states. arXiv preprint arXiv:2406.09455, 2024
-
[37]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation
Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation. arXiv preprint arXiv:2406.18522, 2024
-
[39]
Open-sora: Democratizing efficient video production for all, March 2024
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all, March 2024. URL https://github.com/hpcaitech/Open-Sora
work page 2024
-
[40]
Is sora a world simulator? a comprehensive survey on general world models and beyond
Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Nianchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, Chi Zhang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.