VideoPhy: Evaluating Physical Commonsense for Video Generation
Pith reviewed 2026-05-20 11:30 UTC · model grok-4.3
The pith
Text-to-video models generate videos that follow both captions and physical laws in fewer than 40 percent of cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VideoPhy reveals that existing text-to-video generative models severely lack the ability to generate videos adhering to the given text prompts while also lacking physical commonsense, with the best model succeeding on only 39.6 percent of instances.
What carries the argument
The VideoPhy benchmark, which supplies diverse prompts involving material-type interactions and measures success via human judgment of caption adherence plus physical-law compliance.
If this is right
- Video generative models remain far from accurately simulating the physical world.
- Progress on future models can be tracked with the released VideoPhy prompts and protocol.
- The automated VideoCon-Physics evaluator can be applied to newly released models without repeated human studies.
Where Pith is reading between the lines
- Improved performance on VideoPhy could make generated videos more usable for planning tasks that require realistic motion.
- Weak results on fluid-solid interactions may indicate specific gaps that targeted training data or loss terms could address.
- The same curation approach could be extended to create benchmarks for other forms of commonsense such as object permanence or causal chains.
Load-bearing premise
Human evaluators can reliably and consistently judge whether a generated video follows physical commonsense for the curated prompts.
What would settle it
A new model that produces videos judged by humans to follow both the prompt and physical laws on more than 70 percent of VideoPhy instances would weaken the claim that current generators lack physical commonsense.
read the original abstract
Recent advances in internet-scale video data pretraining have led to the development of text-to-video generative models that can create high-quality videos across a broad range of visual concepts, synthesize realistic motions and render complex objects. Hence, these generative models have the potential to become general-purpose simulators of the physical world. However, it is unclear how far we are from this goal with the existing text-to-video generative models. To this end, we present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities (e.g. marbles will roll down when placed on a slanted surface). Specifically, we curate diverse prompts that involve interactions between various material types in the physical world (e.g., solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on these captions from diverse state-of-the-art text-to-video generative models, including open models (e.g., CogVideoX) and closed models (e.g., Lumiere, Dream Machine). Our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts, while also lack physical commonsense. Specifically, the best performing model, CogVideoX-5B, generates videos that adhere to the caption and physical laws for 39.6% of the instances. VideoPhy thus highlights that the video generative models are far from accurately simulating the physical world. Finally, we propose an auto-evaluator, VideoCon-Physics, to assess the performance reliably for the newly released models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VideoPhy, a benchmark for assessing physical commonsense in text-to-video generative models. It curates prompts involving material interactions (solid-solid, solid-fluid, fluid-fluid), generates videos from open and closed SOTA models (e.g., CogVideoX-5B, Lumiere), and reports human evaluation results showing that even the best model adheres to both the caption and physical laws in only 39.6% of cases. The work also proposes an automatic evaluator, VideoCon-Physics, for scalable assessment of future models.
Significance. If the human evaluation results hold, this benchmark provides concrete evidence of a substantial gap in current video generation models' ability to simulate real-world physics, which is important for their potential use as general-purpose simulators. The direct use of human judgments on curated physical interactions supplies falsifiable, model-agnostic evidence rather than relying on self-referential metrics. The proposal of VideoCon-Physics is a constructive addition for reproducibility and future work. The evaluation across both open and closed models and the focus on diverse material-type interactions are particular strengths.
major comments (2)
- [Human Evaluation] Human evaluation protocol: the central 39.6% figure for CogVideoX-5B (and all other reported percentages) is presented without inter-annotator agreement statistics (e.g., Fleiss' kappa or pairwise agreement) or error bars on the physical-laws label. Because the claim that models 'severely lack' physical commonsense rests directly on these human judgments, the absence of agreement data makes it difficult to separate model failure from annotator variance.
- [Benchmark Construction] Prompt curation and validation: the description of how prompts were selected and verified to test genuine physical commonsense (rather than ambiguous or underspecified cases) remains high-level. More detail on the curation process, including any expert review or pilot testing for physical accuracy, would be needed to establish that the benchmark instances are load-bearing tests of the claimed capability gap.
minor comments (3)
- [Abstract] The abstract and results sections use the phrase 'severely lack' for the 39.6% figure; a more precise statement of the quantitative gap would improve tone and clarity.
- [Results] Table or figure presenting per-category breakdown (solid-solid vs. fluid-fluid, etc.) would help readers assess whether failures are uniform or concentrated in particular interaction types.
- [Auto-Evaluator] The auto-evaluator VideoCon-Physics is introduced but its correlation with human judgments and any ablation on its training data are not detailed enough for independent reproduction.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our paper. We address the major comments below and plan to incorporate revisions to improve the clarity and rigor of our human evaluation and benchmark construction sections.
read point-by-point responses
-
Referee: [Human Evaluation] Human evaluation protocol: the central 39.6% figure for CogVideoX-5B (and all other reported percentages) is presented without inter-annotator agreement statistics (e.g., Fleiss' kappa or pairwise agreement) or error bars on the physical-laws label. Because the claim that models 'severely lack' physical commonsense rests directly on these human judgments, the absence of agreement data makes it difficult to separate model failure from annotator variance.
Authors: We agree that reporting inter-annotator agreement is important for validating the reliability of our human evaluation results. In the revised manuscript, we will include Fleiss' kappa scores for the annotations on physical adherence and caption adherence. Additionally, we will provide error bars or confidence intervals for the reported percentages to better quantify the variability in the human judgments. This will help demonstrate that the observed low performance is indeed due to model limitations rather than annotator disagreement. revision: yes
-
Referee: [Benchmark Construction] Prompt curation and validation: the description of how prompts were selected and verified to test genuine physical commonsense (rather than ambiguous or underspecified cases) remains high-level. More detail on the curation process, including any expert review or pilot testing for physical accuracy, would be needed to establish that the benchmark instances are load-bearing tests of the claimed capability gap.
Authors: We thank the referee for this suggestion. In the original manuscript, we provided a high-level overview of the prompt curation to maintain focus on the evaluation results. However, we acknowledge that additional details would enhance the reproducibility and credibility of the benchmark. In the revised version, we will expand the section on benchmark construction to include more specifics on the prompt selection criteria, the process of verifying physical accuracy through pilot studies, and any expert consultations or reviews conducted to ensure the prompts test genuine physical commonsense without ambiguity. revision: yes
Circularity Check
No circularity in benchmark evaluation or auto-evaluator proposal
full rationale
The paper curates a set of text prompts involving physical interactions across material types and evaluates outputs from existing text-to-video models via human judgment on adherence to both captions and physical laws. No equations, parameter fitting, or first-principles derivations are claimed; the 39.6% figure for CogVideoX-5B is a direct empirical count from external model generations and annotator labels. The proposed VideoCon-Physics auto-evaluator is introduced as a new tool without reducing to any self-citation chain or redefinition of inputs. All load-bearing steps rely on independent human evaluation protocols and publicly available generative models rather than internal consistency loops.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human raters can accurately detect violations of physical commonsense in short video clips.
Forward citations
Cited by 20 Pith papers
-
PhyGround: Benchmarking Physical Reasoning in Generative World Models
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
-
Do Joint Audio-Video Generation Models Understand Physics?
Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
-
CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection
CMTA detects AI-generated videos by capturing unnatural temporal stability in visual-textual semantic alignment via joint embeddings and multi-grained temporal modeling, outperforming prior methods in cross-generator tests.
-
MoRight: Motion Control Done Right
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
-
PlayWorld: Learning Robot World Models from Autonomous Play
PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy p...
-
VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?
VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.
-
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperatio...
-
NEWTON: Agentic Planning for Physically Grounded Video Generation
NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.
-
Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos
MIGA introduces two-stage alignment to close train-inference gaps and dual consistency enhancement via self-reflection and long-range guidance to achieve SOTA temporal consistency in infinite-frame video generation on...
-
PanoWorld: Geometry-Consistent Panoramic Video World Modeling
PanoWorld adds depth consistency and trajectory consistency losses plus spherical adaptations to a pre-trained video model, plus a new PanoGeo dataset, to produce geometry-consistent 360 video.
-
Quantitative Video World Model Evaluation for Geometric-Consistency
PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.
-
How Far Are Video Models from True Multimodal Reasoning?
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
-
ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity
ATSS detects AI-generated videos by measuring unnatural repetitive temporal correlations in triple similarity matrices derived from frame visuals and semantic descriptions.
-
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
-
Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations
RIGVid shows that filtered AI-generated videos can serve as effective supervision for complex robotic manipulation tasks without any real demonstrations.
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
-
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
PhyGenBench supplies 160 prompts across 27 physical laws and an automated LLM/VLM evaluation pipeline to measure physical commonsense compliance in current text-to-video models.
-
Actionable World Representation
WorldString is a fully differentiable neural model for representing actionable object states learned from 3D sensor data.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
World Simulation with Video Foundation Models for Physical AI
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
Reference graph
Works this paper leans on
-
[1]
Luma Dream Machine | AI Video Generator — lumalabs.ai
Luma AI. Luma Dream Machine | AI Video Generator — lumalabs.ai. https://lumalabs. ai/dream-machine, 2024
work page 2024
-
[2]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021
work page 2021
-
[3]
Videocon: Robust video-language alignment via contrast captions
Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, and Aditya Grover. Videocon: Robust video-language alignment via contrast captions. arXiv preprint arXiv:2311.10111, 2023
-
[4]
Talc: Time-aligned captions for multi-scene text-to-video generation
Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya Grover, and Kai-Wei Chang. Talc: Time-aligned captions for multi-scene text-to-video generation. arXiv preprint arXiv:2405.04682, 2024
-
[5]
Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, and Aditya Grover. Comparing bad apples to good oranges: Aligning large language models via joint preference optimization. arXiv preprint arXiv:2404.00530, 2024
-
[6]
Hritik Bansal, Da Yin, Masoud Monajatipoor, and Kai-Wei Chang. How well can text-to- image generative models understand ethical natural language interventions? arXiv preprint arXiv:2210.15230, 2022
-
[7]
Lumiere: A space-time diffusion model for video generation
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024
-
[8]
David Baraff. An introduction to physically based modeling: rigid body simulation i—unconstrained rigid body dynamics. SIGGRAPH course notes, 82, 1997
work page 1997
-
[9]
A fast variational framework for accurate solid-fluid coupling
Christopher Batty, Florence Bertails, and Robert Bridson. A fast variational framework for accurate solid-fluid coupling. ACM Transactions on Graphics (TOG), 26(3):100–es, 2007
work page 2007
-
[10]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020
work page 2020
-
[11]
Visit-bench: A benchmark for vision-language instruction following inspired by real-world use
Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schmidt. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595, 2023
-
[13]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Align your latents: High-resolution video synthesis with latent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023
work page 2023
-
[15]
Fluid simulation for computer graphics
Robert Bridson. Fluid simulation for computer graphics. AK Peters/CRC Press, 2015
work page 2015
-
[16]
Generating long videos of dynamic scenes
Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei Efros, and Tero Karras. Generating long videos of dynamic scenes. Advances in Neural Information Processing Systems, 35:31769–31781, 2022. 12
work page 2022
-
[17]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024
work page 2024
-
[18]
Genie: Generative interactive environments
Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. arXiv preprint arXiv:2402.15391, 2024
-
[19]
Storybench: A multifaceted benchmark for continuous story visualization
Emanuele Bugliarello, H Hernan Moraldo, Ruben Villegas, Mohammad Babaeizadeh, Moham- mad Taghi Saffar, Han Zhang, Dumitru Erhan, Vittorio Ferrari, Pieter-Jan Kindermans, and Paul V oigtlaender. Storybench: A multifaceted benchmark for continuous story visualization. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[20]
cerspense/zeroscope_v2_576w · Hugging Face — huggingface.co
cerspense. cerspense/zeroscope_v2_576w · Hugging Face — huggingface.co. https:// huggingface.co/cerspense/zeroscope_v2_576w, 2023
work page 2023
-
[21]
Videocrafter2: Overcoming data limitations for high-quality video diffusion models
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024
-
[22]
Physical simulation of environmentally induced thin shell deformation
Hsiao-Yu Chen, Arnav Sastry, Wim M van Rees, and Etienne V ouga. Physical simulation of environmentally induced thin shell deformation. ACM Transactions on Graphics (TOG), 37(4):1–13, 2018
work page 2018
-
[23]
Panda-70m: Captioning 70m videos with multiple cross-modality teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. arXiv preprint arXiv:2402.19479, 2024
-
[24]
Yunuo Chen, Tianyi Xie, Cem Yuksel, Danny Kaufman, Yin Yang, Chenfanfu Jiang, and Minchen Li. Multi-layer thick shells. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–9, 2023
work page 2023
-
[25]
Learning universal policies via text-guided video generation
Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[26]
A survey on machine learning approaches for modelling intuitive physics
Jiafei Duan, Arijit Dasgupta, Jason Fischer, and Cheston Tan. A survey on machine learning approaches for modelling intuitive physics. arXiv preprint arXiv:2202.06481, 2022
-
[27]
Structure and content-guided video synthesis with diffusion models
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Ger- manidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023
work page 2023
-
[28]
Scaling rectified flow transform- ers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[29]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021
work page 2021
-
[30]
Yu Fang, Ziyin Qu, Minchen Li, Xinxin Zhang, Yixin Zhu, Mridul Aanjaneya, and Chenfanfu Jiang. Iq-mpm: an interface quadrature material point method for non-sticky strongly two-way coupled nonlinear solids and fluids. ACM Transactions on Graphics (TOG), 39(4):51–1, 2020
work page 2020
-
[31]
Datacomp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[32]
genmo. Genmo. Create videos and images with AI. — genmo.ai. https://www.genmo.ai/. 13
-
[33]
Maniskill2: A unified benchmark for generalizable manipulation skills
Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills. arXiv preprint arXiv:2302.04659, 2023
-
[34]
A convex formulation of frictional contact between rigid and deformable bodies
Xuchen Han, Joseph Masterjohn, and Alejandro Castro. A convex formulation of frictional contact between rigid and deformable bodies. IEEE Robotics and Automation Letters, 2023
work page 2023
-
[35]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[36]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[38]
Dreamphysics: Learning physical properties of dynamic 3d gaussians with video diffusion priors
Tianyu Huang, Yihan Zeng, Hui Li, Wangmeng Zuo, and Rynson WH Lau. Dreamphysics: Learning physical properties of dynamic 3d gaussians with video diffusion priors. arXiv preprint arXiv:2406.01476, 2024
-
[39]
Plasticinelab: A soft-body manipulation benchmark with differentiable physics
Zhiao Huang, Yuanming Hu, Tao Du, Siyuan Zhou, Hao Su, Joshua B Tenenbaum, and Chuang Gan. Plasticinelab: A soft-body manipulation benchmark with differentiable physics. arXiv preprint arXiv:2104.03311, 2021
-
[40]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. arXiv preprint arXiv:2311.17982, 2023
-
[41]
EulerDiscreteScheduler — huggingface.co
huggingfaceEulerDiscreteScheduler. EulerDiscreteScheduler — huggingface.co. https: //huggingface.co/docs/diffusers/en/api/schedulers/euler
-
[42]
Text2video-zero: Text-to-image diffusion mod- els are zero-shot video generators
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion mod- els are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023
work page 2023
-
[43]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[44]
Drucker-prager elastoplasticity for sand animation
Gergely Klár, Theodore Gast, Andre Pradhana, Chuyuan Fu, Craig Schroeder, Chenfanfu Jiang, and Joseph Teran. Drucker-prager elastoplasticity for sand animation. ACM Transactions on Graphics (TOG), 35(4):1–12, 2016
work page 2016
-
[45]
KlingAI. KLING AI — klingai.com. https://www.klingai.com/, 2024
work page 2024
-
[46]
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Smoothed particle hydrodynamics techniques for the physics based simulation of fluids and solids
Dan Koschier, Jan Bender, Barbara Solenthaler, and Matthias Teschner. Smoothed particle hydrodynamics techniques for the physics based simulation of fluids and solids. arXiv preprint arXiv:2009.06944, 2020
-
[48]
Subjective-aligned dateset and metric for text-to-video quality assessment
Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, and Ning Liu. Subjective-aligned dateset and metric for text-to-video quality assessment. arXiv preprint arXiv:2403.11956, 2024
-
[49]
Viescore: Towards explain- able metrics for conditional image synthesis evaluation
Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explain- able metrics for conditional image synthesis evaluation. arXiv preprint arXiv:2312.14867, 2023. 14
-
[50]
LaionAI. GitHub - LAION-AI/aesthetic-predictor: A linear estimator on top of clip to predict the aesthetic quality of pictures — github.com. https://github.com/LAION-AI/ aesthetic-predictor, 2022
work page 2022
-
[51]
Variational stokes: a unified pressure- viscosity solver for accurate viscous liquids
Egor Larionov, Christopher Batty, and Robert Bridson. Variational stokes: a unified pressure- viscosity solver for accurate viscous liquids. ACM Transactions on Graphics (TOG), 36(4):1– 11, 2017
work page 2017
-
[52]
Aligning Text-to-Image Models using Human Feedback
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
James R Lewis and O˘guzhan Erdinç. User experience rating scales with 7, 11, or 101 points: does it matter? Journal of Usability Studies, 12(2), 2017
work page 2017
-
[54]
Incremental potential contact: intersection- and inversion-free, large-deformation dynamics
Minchen Li, Zachary Ferguson, Teseo Schneider, Timothy R Langlois, Denis Zorin, Daniele Panozzo, Chenfanfu Jiang, and Danny M Kaufman. Incremental potential contact: intersection- and inversion-free, large-deformation dynamics. ACM Trans. Graph., 39(4):49, 2020
work page 2020
-
[55]
Codimensional incremental potential contact
Minchen Li, Danny M Kaufman, and Chenfanfu Jiang. Codimensional incremental potential contact. arXiv preprint arXiv:2012.04457, 2020
-
[56]
Aligning diffusion models by optimizing human utility
Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Aligning diffusion models by optimizing human utility. arXiv preprint arXiv:2404.04465, 2024
-
[57]
Energetically consistent inelasticity for optimiza- tion time integration
Xuan Li, Minchen Li, and Chenfanfu Jiang. Energetically consistent inelasticity for optimiza- tion time integration. ACM Transactions on Graphics (TOG), 41(4):1–16, 2022
work page 2022
-
[58]
Gpu-accelerated robotic simulation for distributed reinforcement learning
Jacky Liang, Viktor Makoviychuk, Ankur Handa, Nuttapong Chentanez, Miles Macklin, and Dieter Fox. Gpu-accelerated robotic simulation for distributed reinforcement learning. In Conference on Robot Learning, pages 270–282. PMLR, 2018
work page 2018
-
[59]
Evaluating text-to-visual generation with image-to-text generation
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. arXiv preprint arXiv:2404.01291, 2024
-
[60]
Physics3d: Learning physical properties of 3d gaussians via video diffusion
Fangfu Liu, Hanyang Wang, Shunyu Yao, Shengjun Zhang, Jie Zhou, and Yueqi Duan. Physics3d: Learning physical properties of 3d gaussians via video diffusion. arXiv preprint arXiv:2406.04338, 2024
-
[61]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024
work page 2024
-
[62]
Physgen: Rigid-body physics-grounded image-to-video generation
Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation
-
[63]
Evalcrafter: Benchmarking and evaluating large video generation models
Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22139–22149, 2024
work page 2024
-
[64]
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
Effect of the number of response categories on the reliability and validity of rating scales
Luis M Lozano, Eduardo García-Cueto, and José Muñiz. Effect of the number of response categories on the reliability and validity of rating scales. Methodology, 4(2):73–79, 2008
work page 2008
-
[66]
DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm- solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022. 15
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[67]
Physically- aware generative network for 3d shape modeling
Mariem Mezghanni, Malika Boulkenafed, Andre Lieutier, and Maks Ovsjanikov. Physically- aware generative network for 3d shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9330–9341, 2021
work page 2021
-
[68]
mplugowl. mplug-owl-video. https://github.com/X-PLUG/mPLUG-Owl/tree/main/ mPLUG-Owl/mplug_owl_video
-
[69]
Particle-based fluid-fluid interaction
Matthias Müller, Barbara Solenthaler, Richard Keiser, and Markus Gross. Particle-based fluid-fluid interaction. In Proceedings of the 2005 ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 237–244, 2005
work page 2005
-
[70]
Phyrecon: Physically plausible neural scene reconstruction
Junfeng Ni, Yixin Chen, Bohan Jing, Nan Jiang, Bin Wang, Bo Dai, Yixin Zhu, Song-Chun Zhu, and Siyuan Huang. Phyrecon: Physically plausible neural scene reconstruction. arXiv preprint arXiv:2404.16666, 2024
-
[71]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021
work page 2021
-
[72]
Graphical modeling and animation of ductile fracture
James F O’brien, Adam W Bargteil, and Jessica K Hodgins. Graphical modeling and animation of ductile fracture. In Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pages 291–294, 2002
work page 2002
-
[73]
OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023a, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
Gpt-4v(ision) system card, 2023b
OpenAI. Gpt-4v(ision) system card, 2023b. https://openai.com/research/ gpt-4v-system-card , 2023
work page 2023
-
[75]
OpenSora. GitHub - hpcaitech/Open-Sora: Open-Sora: Democratizing Efficient Video Pro- duction for All — github.com. https://github.com/hpcaitech/Open-Sora, 2024
work page 2024
-
[76]
Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models
Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, et al. Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models. arXiv preprint arXiv:2405.02287, 2024
-
[77]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023
work page 2023
- [78]
-
[79]
Intuitive physics learning in a deep-learning model inspired by developmental psychology
Luis S Piloto, Ari Weinstein, Peter Battaglia, and Matthew Botvinick. Intuitive physics learning in a deep-learning model inspired by developmental psychology. Nature human behaviour, 6(9):1257–1267, 2022
work page 2022
-
[80]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[81]
Power plastics: A hybrid lagrangian/eulerian solver for mesoscale inelastic flows
Ziyin Qu, Minchen Li, Yin Yang, Chenfanfu Jiang, and Fernando De Goes. Power plastics: A hybrid lagrangian/eulerian solver for mesoscale inelastic flows. ACM Transactions on Graphics (TOG), 42(6):1–11, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.