VideoPhy: Evaluating Physical Commonsense for Video Generation

Aditya Grover; Chenfanfu Jiang; Hritik Bansal; Kai-Wei Chang; Michal Yarom; Tianyi Xie; Yizhou Sun; Yonatan Bitton; Zeshun Zong; Zongyu Lin

arxiv: 2406.03520 · v2 · pith:QZHGDGY5new · submitted 2024-06-05 · 💻 cs.CV · cs.AI· cs.LG

VideoPhy: Evaluating Physical Commonsense for Video Generation

Hritik Bansal , Zongyu Lin , Tianyi Xie , Zeshun Zong , Michal Yarom , Yonatan Bitton , Chenfanfu Jiang , Yizhou Sun

show 2 more authors

Kai-Wei Chang Aditya Grover

This is my paper

Pith reviewed 2026-05-20 11:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords text-to-video generationphysical commonsensebenchmarkvideo evaluationgenerative modelshuman evaluation

0 comments

The pith

Text-to-video models generate videos that follow both captions and physical laws in fewer than 40 percent of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VideoPhy, a benchmark of text prompts that describe everyday physical interactions between solids, fluids, and other materials. Researchers generate videos from multiple current models and ask human evaluators to check whether each video matches the prompt and also obeys real-world physics such as rolling, pouring, or breaking. Even the strongest model tested, CogVideoX-5B, succeeds on both criteria only 39.6 percent of the time. The results indicate that present generators are still far from functioning as accurate simulators of the physical world. The authors also release an automated scorer, VideoCon-Physics, to evaluate future models more quickly.

Core claim

VideoPhy reveals that existing text-to-video generative models severely lack the ability to generate videos adhering to the given text prompts while also lacking physical commonsense, with the best model succeeding on only 39.6 percent of instances.

What carries the argument

The VideoPhy benchmark, which supplies diverse prompts involving material-type interactions and measures success via human judgment of caption adherence plus physical-law compliance.

If this is right

Video generative models remain far from accurately simulating the physical world.
Progress on future models can be tracked with the released VideoPhy prompts and protocol.
The automated VideoCon-Physics evaluator can be applied to newly released models without repeated human studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improved performance on VideoPhy could make generated videos more usable for planning tasks that require realistic motion.
Weak results on fluid-solid interactions may indicate specific gaps that targeted training data or loss terms could address.
The same curation approach could be extended to create benchmarks for other forms of commonsense such as object permanence or causal chains.

Load-bearing premise

Human evaluators can reliably and consistently judge whether a generated video follows physical commonsense for the curated prompts.

What would settle it

A new model that produces videos judged by humans to follow both the prompt and physical laws on more than 70 percent of VideoPhy instances would weaken the claim that current generators lack physical commonsense.

read the original abstract

Recent advances in internet-scale video data pretraining have led to the development of text-to-video generative models that can create high-quality videos across a broad range of visual concepts, synthesize realistic motions and render complex objects. Hence, these generative models have the potential to become general-purpose simulators of the physical world. However, it is unclear how far we are from this goal with the existing text-to-video generative models. To this end, we present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities (e.g. marbles will roll down when placed on a slanted surface). Specifically, we curate diverse prompts that involve interactions between various material types in the physical world (e.g., solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on these captions from diverse state-of-the-art text-to-video generative models, including open models (e.g., CogVideoX) and closed models (e.g., Lumiere, Dream Machine). Our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts, while also lack physical commonsense. Specifically, the best performing model, CogVideoX-5B, generates videos that adhere to the caption and physical laws for 39.6% of the instances. VideoPhy thus highlights that the video generative models are far from accurately simulating the physical world. Finally, we propose an auto-evaluator, VideoCon-Physics, to assess the performance reliably for the newly released models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VideoPhy gives a useful new benchmark showing top video models hit only 39.6% on physical commonsense in human tests, but the lack of agreement metrics leaves the headline number a bit soft.

read the letter

The main point is that current text-to-video models still fall short on basic physical interactions. The authors built VideoPhy with prompts covering solid-solid, solid-fluid, and fluid-fluid cases, ran several models including CogVideoX-5B and closed systems like Lumiere, and found the best one succeeds on both caption and physics in only 39.6% of cases by human judgment. They also release an auto-evaluator called VideoCon-Physics to help with scaling checks later. That combination of focused prompts and a practical tool is the concrete addition here.

Referee Report

2 major / 3 minor

Summary. The paper introduces VideoPhy, a benchmark for assessing physical commonsense in text-to-video generative models. It curates prompts involving material interactions (solid-solid, solid-fluid, fluid-fluid), generates videos from open and closed SOTA models (e.g., CogVideoX-5B, Lumiere), and reports human evaluation results showing that even the best model adheres to both the caption and physical laws in only 39.6% of cases. The work also proposes an automatic evaluator, VideoCon-Physics, for scalable assessment of future models.

Significance. If the human evaluation results hold, this benchmark provides concrete evidence of a substantial gap in current video generation models' ability to simulate real-world physics, which is important for their potential use as general-purpose simulators. The direct use of human judgments on curated physical interactions supplies falsifiable, model-agnostic evidence rather than relying on self-referential metrics. The proposal of VideoCon-Physics is a constructive addition for reproducibility and future work. The evaluation across both open and closed models and the focus on diverse material-type interactions are particular strengths.

major comments (2)

[Human Evaluation] Human evaluation protocol: the central 39.6% figure for CogVideoX-5B (and all other reported percentages) is presented without inter-annotator agreement statistics (e.g., Fleiss' kappa or pairwise agreement) or error bars on the physical-laws label. Because the claim that models 'severely lack' physical commonsense rests directly on these human judgments, the absence of agreement data makes it difficult to separate model failure from annotator variance.
[Benchmark Construction] Prompt curation and validation: the description of how prompts were selected and verified to test genuine physical commonsense (rather than ambiguous or underspecified cases) remains high-level. More detail on the curation process, including any expert review or pilot testing for physical accuracy, would be needed to establish that the benchmark instances are load-bearing tests of the claimed capability gap.

minor comments (3)

[Abstract] The abstract and results sections use the phrase 'severely lack' for the 39.6% figure; a more precise statement of the quantitative gap would improve tone and clarity.
[Results] Table or figure presenting per-category breakdown (solid-solid vs. fluid-fluid, etc.) would help readers assess whether failures are uniform or concentrated in particular interaction types.
[Auto-Evaluator] The auto-evaluator VideoCon-Physics is introduced but its correlation with human judgments and any ablation on its training data are not detailed enough for independent reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our paper. We address the major comments below and plan to incorporate revisions to improve the clarity and rigor of our human evaluation and benchmark construction sections.

read point-by-point responses

Referee: [Human Evaluation] Human evaluation protocol: the central 39.6% figure for CogVideoX-5B (and all other reported percentages) is presented without inter-annotator agreement statistics (e.g., Fleiss' kappa or pairwise agreement) or error bars on the physical-laws label. Because the claim that models 'severely lack' physical commonsense rests directly on these human judgments, the absence of agreement data makes it difficult to separate model failure from annotator variance.

Authors: We agree that reporting inter-annotator agreement is important for validating the reliability of our human evaluation results. In the revised manuscript, we will include Fleiss' kappa scores for the annotations on physical adherence and caption adherence. Additionally, we will provide error bars or confidence intervals for the reported percentages to better quantify the variability in the human judgments. This will help demonstrate that the observed low performance is indeed due to model limitations rather than annotator disagreement. revision: yes
Referee: [Benchmark Construction] Prompt curation and validation: the description of how prompts were selected and verified to test genuine physical commonsense (rather than ambiguous or underspecified cases) remains high-level. More detail on the curation process, including any expert review or pilot testing for physical accuracy, would be needed to establish that the benchmark instances are load-bearing tests of the claimed capability gap.

Authors: We thank the referee for this suggestion. In the original manuscript, we provided a high-level overview of the prompt curation to maintain focus on the evaluation results. However, we acknowledge that additional details would enhance the reproducibility and credibility of the benchmark. In the revised version, we will expand the section on benchmark construction to include more specifics on the prompt selection criteria, the process of verifying physical accuracy through pilot studies, and any expert consultations or reviews conducted to ensure the prompts test genuine physical commonsense without ambiguity. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark evaluation or auto-evaluator proposal

full rationale

The paper curates a set of text prompts involving physical interactions across material types and evaluates outputs from existing text-to-video models via human judgment on adherence to both captions and physical laws. No equations, parameter fitting, or first-principles derivations are claimed; the 39.6% figure for CogVideoX-5B is a direct empirical count from external model generations and annotator labels. The proposed VideoCon-Physics auto-evaluator is introduced as a new tool without reducing to any self-citation chain or redefinition of inputs. All load-bearing steps rely on independent human evaluation protocols and publicly available generative models rather than internal consistency loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the curated prompts adequately sample physical commonsense and that human raters apply consistent criteria; no free parameters are fitted, no new entities are postulated, and axioms are standard assumptions about physical laws and evaluation validity.

axioms (1)

domain assumption Human raters can accurately detect violations of physical commonsense in short video clips.
Invoked in the human evaluation section to interpret the 39.6% adherence rate.

pith-pipeline@v0.9.0 · 5843 in / 1222 out tokens · 36597 ms · 2026-05-20T11:30:01.354328+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PhyGround: Benchmarking Physical Reasoning in Generative World Models
cs.CV 2026-05 accept novelty 7.0

PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
Do Joint Audio-Video Generation Models Understand Physics?
cs.SD 2026-05 unverdicted novelty 7.0

Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection
cs.CV 2026-05 unverdicted novelty 7.0

CMTA detects AI-generated videos by capturing unnatural temporal stability in visual-textual semantic alignment via joint embeddings and multi-grained temporal modeling, outperforming prior methods in cross-generator tests.
MoRight: Motion Control Done Right
cs.CV 2026-04 unverdicted novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
PlayWorld: Learning Robot World Models from Autonomous Play
cs.RO 2026-03 unverdicted novelty 7.0

PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy p...
VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?
cs.CV 2025-12 unverdicted novelty 7.0

VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
cs.RO 2025-05 unverdicted novelty 7.0

DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperatio...
NEWTON: Agentic Planning for Physically Grounded Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.
Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos
cs.CV 2026-05 unverdicted novelty 6.0

MIGA introduces two-stage alignment to close train-inference gaps and dual consistency enhancement via self-reflection and long-range guidance to achieve SOTA temporal consistency in infinite-frame video generation on...
PanoWorld: Geometry-Consistent Panoramic Video World Modeling
cs.CV 2026-05 unverdicted novelty 6.0

PanoWorld adds depth consistency and trajectory consistency losses plus spherical adaptations to a pre-trained video model, plus a new PanoGeo dataset, to produce geometry-consistent 360 video.
Quantitative Video World Model Evaluation for Geometric-Consistency
cs.CV 2026-05 unverdicted novelty 6.0

PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.
How Far Are Video Models from True Multimodal Reasoning?
cs.CV 2026-04 unverdicted novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity
cs.CV 2026-04 unverdicted novelty 6.0

ATSS detects AI-generated videos by measuring unnatural repetitive temporal correlations in triple similarity matrices derived from frame visuals and semantic descriptions.
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
cs.CV 2025-10 unverdicted novelty 6.0

RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations
cs.RO 2025-07 unverdicted novelty 6.0

RIGVid shows that filtered AI-generated videos can serve as effective supervision for complex robotic manipulation tasks without any real demonstrations.
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
cs.CV 2024-10 unverdicted novelty 6.0

PhyGenBench supplies 160 prompts across 27 physical laws and an automated LLM/VLM evaluation pipeline to measure physical commonsense compliance in current text-to-video models.
Actionable World Representation
cs.AI 2026-05 unverdicted novelty 4.0

WorldString is a fully differentiable neural model for representing actionable object states learned from 3D sensor data.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
World Simulation with Video Foundation Models for Physical AI
cs.CV 2025-10 unverdicted novelty 4.0

Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

Reference graph

Works this paper leans on

135 extracted references · 135 canonical work pages · cited by 20 Pith papers · 20 internal anchors

[1]

Luma Dream Machine | AI Video Generator — lumalabs.ai

Luma AI. Luma Dream Machine | AI Video Generator — lumalabs.ai. https://lumalabs. ai/dream-machine, 2024

work page 2024
[2]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021

work page 2021
[3]

Videocon: Robust video-language alignment via contrast captions

Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, and Aditya Grover. Videocon: Robust video-language alignment via contrast captions. arXiv preprint arXiv:2311.10111, 2023

work page arXiv 2023
[4]

Talc: Time-aligned captions for multi-scene text-to-video generation

Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya Grover, and Kai-Wei Chang. Talc: Time-aligned captions for multi-scene text-to-video generation. arXiv preprint arXiv:2405.04682, 2024

work page arXiv 2024
[5]

Comparing bad apples to good oranges: Aligning large language models via joint preference optimization

Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, and Aditya Grover. Comparing bad apples to good oranges: Aligning large language models via joint preference optimization. arXiv preprint arXiv:2404.00530, 2024

work page arXiv 2024
[6]

How well can text-to- image generative models understand ethical natural language interventions? arXiv preprint arXiv:2210.15230, 2022

Hritik Bansal, Da Yin, Masoud Monajatipoor, and Kai-Wei Chang. How well can text-to- image generative models understand ethical natural language interventions? arXiv preprint arXiv:2210.15230, 2022

work page arXiv 2022
[7]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024

work page arXiv 2024
[8]

An introduction to physically based modeling: rigid body simulation i—unconstrained rigid body dynamics

David Baraff. An introduction to physically based modeling: rigid body simulation i—unconstrained rigid body dynamics. SIGGRAPH course notes, 82, 1997

work page 1997
[9]

A fast variational framework for accurate solid-fluid coupling

Christopher Batty, Florence Bertails, and Robert Bridson. A fast variational framework for accurate solid-fluid coupling. ACM Transactions on Graphics (TOG), 26(3):100–es, 2007

work page 2007
[10]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

work page 2020
[11]

Visit-bench: A benchmark for vision-language instruction following inspired by real-world use

Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schmidt. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595, 2023

work page arXiv 2023
[13]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023

work page 2023
[15]

Fluid simulation for computer graphics

Robert Bridson. Fluid simulation for computer graphics. AK Peters/CRC Press, 2015

work page 2015
[16]

Generating long videos of dynamic scenes

Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei Efros, and Tero Karras. Generating long videos of dynamic scenes. Advances in Neural Information Processing Systems, 35:31769–31781, 2022. 12

work page 2022
[17]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

work page 2024
[18]

Genie: Generative interactive environments

Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. arXiv preprint arXiv:2402.15391, 2024

work page arXiv 2024
[19]

Storybench: A multifaceted benchmark for continuous story visualization

Emanuele Bugliarello, H Hernan Moraldo, Ruben Villegas, Mohammad Babaeizadeh, Moham- mad Taghi Saffar, Han Zhang, Dumitru Erhan, Vittorio Ferrari, Pieter-Jan Kindermans, and Paul V oigtlaender. Storybench: A multifaceted benchmark for continuous story visualization. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[20]

cerspense/zeroscope_v2_576w · Hugging Face — huggingface.co

cerspense. cerspense/zeroscope_v2_576w · Hugging Face — huggingface.co. https:// huggingface.co/cerspense/zeroscope_v2_576w, 2023

work page 2023
[21]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024

work page arXiv 2024
[22]

Physical simulation of environmentally induced thin shell deformation

Hsiao-Yu Chen, Arnav Sastry, Wim M van Rees, and Etienne V ouga. Physical simulation of environmentally induced thin shell deformation. ACM Transactions on Graphics (TOG), 37(4):1–13, 2018

work page 2018
[23]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. arXiv preprint arXiv:2402.19479, 2024

work page arXiv 2024
[24]

Multi-layer thick shells

Yunuo Chen, Tianyi Xie, Cem Yuksel, Danny Kaufman, Yin Yang, Chenfanfu Jiang, and Minchen Li. Multi-layer thick shells. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–9, 2023

work page 2023
[25]

Learning universal policies via text-guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[26]

A survey on machine learning approaches for modelling intuitive physics

Jiafei Duan, Arijit Dasgupta, Jason Fischer, and Cheston Tan. A survey on machine learning approaches for modelling intuitive physics. arXiv preprint arXiv:2202.06481, 2022

work page arXiv 2022
[27]

Structure and content-guided video synthesis with diffusion models

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Ger- manidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023

work page 2023
[28]

Scaling rectified flow transform- ers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[29]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

work page 2021
[30]

Iq-mpm: an interface quadrature material point method for non-sticky strongly two-way coupled nonlinear solids and fluids

Yu Fang, Ziyin Qu, Minchen Li, Xinxin Zhang, Yixin Zhu, Mridul Aanjaneya, and Chenfanfu Jiang. Iq-mpm: an interface quadrature material point method for non-sticky strongly two-way coupled nonlinear solids and fluids. ACM Transactions on Graphics (TOG), 39(4):51–1, 2020

work page 2020
[31]

Datacomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[32]

genmo. Genmo. Create videos and images with AI. — genmo.ai. https://www.genmo.ai/. 13

work page
[33]

Maniskill2: A unified benchmark for generalizable manipulation skills

Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills. arXiv preprint arXiv:2302.04659, 2023

work page arXiv 2023
[34]

A convex formulation of frictional contact between rigid and deformable bodies

Xuchen Han, Joseph Masterjohn, and Alejandro Castro. A convex formulation of frictional contact between rigid and deformable bodies. IEEE Robotics and Automation Letters, 2023

work page 2023
[35]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[36]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

Dreamphysics: Learning physical properties of dynamic 3d gaussians with video diffusion priors

Tianyu Huang, Yihan Zeng, Hui Li, Wangmeng Zuo, and Rynson WH Lau. Dreamphysics: Learning physical properties of dynamic 3d gaussians with video diffusion priors. arXiv preprint arXiv:2406.01476, 2024

work page arXiv 2024
[39]

Plasticinelab: A soft-body manipulation benchmark with differentiable physics

Zhiao Huang, Yuanming Hu, Tao Du, Siyuan Zhou, Hao Su, Joshua B Tenenbaum, and Chuang Gan. Plasticinelab: A soft-body manipulation benchmark with differentiable physics. arXiv preprint arXiv:2104.03311, 2021

work page arXiv 2021
[40]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. arXiv preprint arXiv:2311.17982, 2023

work page arXiv 2023
[41]

EulerDiscreteScheduler — huggingface.co

huggingfaceEulerDiscreteScheduler. EulerDiscreteScheduler — huggingface.co. https: //huggingface.co/docs/diffusers/en/api/schedulers/euler

work page
[42]

Text2video-zero: Text-to-image diffusion mod- els are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion mod- els are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023

work page 2023
[43]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[44]

Drucker-prager elastoplasticity for sand animation

Gergely Klár, Theodore Gast, Andre Pradhana, Chuyuan Fu, Craig Schroeder, Chenfanfu Jiang, and Joseph Teran. Drucker-prager elastoplasticity for sand animation. ACM Transactions on Graphics (TOG), 35(4):1–12, 2016

work page 2016
[45]

KLING AI — klingai.com

KlingAI. KLING AI — klingai.com. https://www.klingai.com/, 2024

work page 2024
[46]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Smoothed particle hydrodynamics techniques for the physics based simulation of fluids and solids

Dan Koschier, Jan Bender, Barbara Solenthaler, and Matthias Teschner. Smoothed particle hydrodynamics techniques for the physics based simulation of fluids and solids. arXiv preprint arXiv:2009.06944, 2020

work page arXiv 2009
[48]

Subjective-aligned dateset and metric for text-to-video quality assessment

Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, and Ning Liu. Subjective-aligned dateset and metric for text-to-video quality assessment. arXiv preprint arXiv:2403.11956, 2024

work page arXiv 2024
[49]

Viescore: Towards explain- able metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explain- able metrics for conditional image synthesis evaluation. arXiv preprint arXiv:2312.14867, 2023. 14

work page arXiv 2023
[50]

GitHub - LAION-AI/aesthetic-predictor: A linear estimator on top of clip to predict the aesthetic quality of pictures — github.com

LaionAI. GitHub - LAION-AI/aesthetic-predictor: A linear estimator on top of clip to predict the aesthetic quality of pictures — github.com. https://github.com/LAION-AI/ aesthetic-predictor, 2022

work page 2022
[51]

Variational stokes: a unified pressure- viscosity solver for accurate viscous liquids

Egor Larionov, Christopher Batty, and Robert Bridson. Variational stokes: a unified pressure- viscosity solver for accurate viscous liquids. ACM Transactions on Graphics (TOG), 36(4):1– 11, 2017

work page 2017
[52]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

User experience rating scales with 7, 11, or 101 points: does it matter? Journal of Usability Studies, 12(2), 2017

James R Lewis and O˘guzhan Erdinç. User experience rating scales with 7, 11, or 101 points: does it matter? Journal of Usability Studies, 12(2), 2017

work page 2017
[54]

Incremental potential contact: intersection- and inversion-free, large-deformation dynamics

Minchen Li, Zachary Ferguson, Teseo Schneider, Timothy R Langlois, Denis Zorin, Daniele Panozzo, Chenfanfu Jiang, and Danny M Kaufman. Incremental potential contact: intersection- and inversion-free, large-deformation dynamics. ACM Trans. Graph., 39(4):49, 2020

work page 2020
[55]

Codimensional incremental potential contact

Minchen Li, Danny M Kaufman, and Chenfanfu Jiang. Codimensional incremental potential contact. arXiv preprint arXiv:2012.04457, 2020

work page arXiv 2012
[56]

Aligning diffusion models by optimizing human utility

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Aligning diffusion models by optimizing human utility. arXiv preprint arXiv:2404.04465, 2024

work page arXiv 2024
[57]

Energetically consistent inelasticity for optimiza- tion time integration

Xuan Li, Minchen Li, and Chenfanfu Jiang. Energetically consistent inelasticity for optimiza- tion time integration. ACM Transactions on Graphics (TOG), 41(4):1–16, 2022

work page 2022
[58]

Gpu-accelerated robotic simulation for distributed reinforcement learning

Jacky Liang, Viktor Makoviychuk, Ankur Handa, Nuttapong Chentanez, Miles Macklin, and Dieter Fox. Gpu-accelerated robotic simulation for distributed reinforcement learning. In Conference on Robot Learning, pages 270–282. PMLR, 2018

work page 2018
[59]

Evaluating text-to-visual generation with image-to-text generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. arXiv preprint arXiv:2404.01291, 2024

work page arXiv 2024
[60]

Physics3d: Learning physical properties of 3d gaussians via video diffusion

Fangfu Liu, Hanyang Wang, Shunyu Yao, Shengjun Zhang, Jie Zhou, and Yueqi Duan. Physics3d: Learning physical properties of 3d gaussians via video diffusion. arXiv preprint arXiv:2406.04338, 2024

work page arXiv 2024
[61]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024

work page 2024
[62]

Physgen: Rigid-body physics-grounded image-to-video generation

Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation

work page
[63]

Evalcrafter: Benchmarking and evaluating large video generation models

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22139–22149, 2024

work page 2024
[64]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Effect of the number of response categories on the reliability and validity of rating scales

Luis M Lozano, Eduardo García-Cueto, and José Muñiz. Effect of the number of response categories on the reliability and validity of rating scales. Methodology, 4(2):73–79, 2008

work page 2008
[66]

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm- solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022. 15

work page internal anchor Pith review Pith/arXiv arXiv 2022
[67]

Physically- aware generative network for 3d shape modeling

Mariem Mezghanni, Malika Boulkenafed, Andre Lieutier, and Maks Ovsjanikov. Physically- aware generative network for 3d shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9330–9341, 2021

work page 2021
[68]

mplug-owl-video

mplugowl. mplug-owl-video. https://github.com/X-PLUG/mPLUG-Owl/tree/main/ mPLUG-Owl/mplug_owl_video

work page
[69]

Particle-based fluid-fluid interaction

Matthias Müller, Barbara Solenthaler, Richard Keiser, and Markus Gross. Particle-based fluid-fluid interaction. In Proceedings of the 2005 ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 237–244, 2005

work page 2005
[70]

Phyrecon: Physically plausible neural scene reconstruction

Junfeng Ni, Yixin Chen, Bohan Jing, Nan Jiang, Bin Wang, Bo Dai, Yixin Zhu, Song-Chun Zhu, and Siyuan Huang. Phyrecon: Physically plausible neural scene reconstruction. arXiv preprint arXiv:2404.16666, 2024

work page arXiv 2024
[71]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021

work page 2021
[72]

Graphical modeling and animation of ductile fracture

James F O’brien, Adam W Bargteil, and Jessica K Hodgins. Graphical modeling and animation of ductile fracture. In Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pages 291–294, 2002

work page 2002
[73]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023a, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Gpt-4v(ision) system card, 2023b

OpenAI. Gpt-4v(ision) system card, 2023b. https://openai.com/research/ gpt-4v-system-card , 2023

work page 2023
[75]

GitHub - hpcaitech/Open-Sora: Open-Sora: Democratizing Efficient Video Pro- duction for All — github.com

OpenSora. GitHub - hpcaitech/Open-Sora: Open-Sora: Democratizing Efficient Video Pro- duction for All — github.com. https://github.com/hpcaitech/Open-Sora, 2024

work page 2024
[76]

Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models

Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, et al. Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models. arXiv preprint arXiv:2405.02287, 2024

work page arXiv 2024
[77]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023
[78]

Pika — pika.art

pika. Pika — pika.art. https://pika.art/

work page
[79]

Intuitive physics learning in a deep-learning model inspired by developmental psychology

Luis S Piloto, Ari Weinstein, Peter Battaglia, and Matthew Botvinick. Intuitive physics learning in a deep-learning model inspired by developmental psychology. Nature human behaviour, 6(9):1257–1267, 2022

work page 2022
[80]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[81]

Power plastics: A hybrid lagrangian/eulerian solver for mesoscale inelastic flows

Ziyin Qu, Minchen Li, Yin Yang, Chenfanfu Jiang, and Fernando De Goes. Power plastics: A hybrid lagrangian/eulerian solver for mesoscale inelastic flows. ACM Transactions on Graphics (TOG), 42(6):1–11, 2023

work page 2023

Showing first 80 references.

[1] [1]

Luma Dream Machine | AI Video Generator — lumalabs.ai

Luma AI. Luma Dream Machine | AI Video Generator — lumalabs.ai. https://lumalabs. ai/dream-machine, 2024

work page 2024

[2] [2]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021

work page 2021

[3] [3]

Videocon: Robust video-language alignment via contrast captions

Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, and Aditya Grover. Videocon: Robust video-language alignment via contrast captions. arXiv preprint arXiv:2311.10111, 2023

work page arXiv 2023

[4] [4]

Talc: Time-aligned captions for multi-scene text-to-video generation

Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya Grover, and Kai-Wei Chang. Talc: Time-aligned captions for multi-scene text-to-video generation. arXiv preprint arXiv:2405.04682, 2024

work page arXiv 2024

[5] [5]

Comparing bad apples to good oranges: Aligning large language models via joint preference optimization

Hritik Bansal, Ashima Suvarna, Gantavya Bhatt, Nanyun Peng, Kai-Wei Chang, and Aditya Grover. Comparing bad apples to good oranges: Aligning large language models via joint preference optimization. arXiv preprint arXiv:2404.00530, 2024

work page arXiv 2024

[6] [6]

How well can text-to- image generative models understand ethical natural language interventions? arXiv preprint arXiv:2210.15230, 2022

Hritik Bansal, Da Yin, Masoud Monajatipoor, and Kai-Wei Chang. How well can text-to- image generative models understand ethical natural language interventions? arXiv preprint arXiv:2210.15230, 2022

work page arXiv 2022

[7] [7]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024

work page arXiv 2024

[8] [8]

An introduction to physically based modeling: rigid body simulation i—unconstrained rigid body dynamics

David Baraff. An introduction to physically based modeling: rigid body simulation i—unconstrained rigid body dynamics. SIGGRAPH course notes, 82, 1997

work page 1997

[9] [9]

A fast variational framework for accurate solid-fluid coupling

Christopher Batty, Florence Bertails, and Robert Bridson. A fast variational framework for accurate solid-fluid coupling. ACM Transactions on Graphics (TOG), 26(3):100–es, 2007

work page 2007

[10] [10]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

work page 2020

[11] [11]

Visit-bench: A benchmark for vision-language instruction following inspired by real-world use

Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schmidt. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595, 2023

work page arXiv 2023

[12] [13]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [14]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023

work page 2023

[14] [15]

Fluid simulation for computer graphics

Robert Bridson. Fluid simulation for computer graphics. AK Peters/CRC Press, 2015

work page 2015

[15] [16]

Generating long videos of dynamic scenes

Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei Efros, and Tero Karras. Generating long videos of dynamic scenes. Advances in Neural Information Processing Systems, 35:31769–31781, 2022. 12

work page 2022

[16] [17]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

work page 2024

[17] [18]

Genie: Generative interactive environments

Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. arXiv preprint arXiv:2402.15391, 2024

work page arXiv 2024

[18] [19]

Storybench: A multifaceted benchmark for continuous story visualization

Emanuele Bugliarello, H Hernan Moraldo, Ruben Villegas, Mohammad Babaeizadeh, Moham- mad Taghi Saffar, Han Zhang, Dumitru Erhan, Vittorio Ferrari, Pieter-Jan Kindermans, and Paul V oigtlaender. Storybench: A multifaceted benchmark for continuous story visualization. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[19] [20]

cerspense/zeroscope_v2_576w · Hugging Face — huggingface.co

cerspense. cerspense/zeroscope_v2_576w · Hugging Face — huggingface.co. https:// huggingface.co/cerspense/zeroscope_v2_576w, 2023

work page 2023

[20] [21]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024

work page arXiv 2024

[21] [22]

Physical simulation of environmentally induced thin shell deformation

Hsiao-Yu Chen, Arnav Sastry, Wim M van Rees, and Etienne V ouga. Physical simulation of environmentally induced thin shell deformation. ACM Transactions on Graphics (TOG), 37(4):1–13, 2018

work page 2018

[22] [23]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. arXiv preprint arXiv:2402.19479, 2024

work page arXiv 2024

[23] [24]

Multi-layer thick shells

Yunuo Chen, Tianyi Xie, Cem Yuksel, Danny Kaufman, Yin Yang, Chenfanfu Jiang, and Minchen Li. Multi-layer thick shells. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–9, 2023

work page 2023

[24] [25]

Learning universal policies via text-guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[25] [26]

A survey on machine learning approaches for modelling intuitive physics

Jiafei Duan, Arijit Dasgupta, Jason Fischer, and Cheston Tan. A survey on machine learning approaches for modelling intuitive physics. arXiv preprint arXiv:2202.06481, 2022

work page arXiv 2022

[26] [27]

Structure and content-guided video synthesis with diffusion models

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Ger- manidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023

work page 2023

[27] [28]

Scaling rectified flow transform- ers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[28] [29]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

work page 2021

[29] [30]

Iq-mpm: an interface quadrature material point method for non-sticky strongly two-way coupled nonlinear solids and fluids

Yu Fang, Ziyin Qu, Minchen Li, Xinxin Zhang, Yixin Zhu, Mridul Aanjaneya, and Chenfanfu Jiang. Iq-mpm: an interface quadrature material point method for non-sticky strongly two-way coupled nonlinear solids and fluids. ACM Transactions on Graphics (TOG), 39(4):51–1, 2020

work page 2020

[30] [31]

Datacomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[31] [32]

genmo. Genmo. Create videos and images with AI. — genmo.ai. https://www.genmo.ai/. 13

work page

[32] [33]

Maniskill2: A unified benchmark for generalizable manipulation skills

Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills. arXiv preprint arXiv:2302.04659, 2023

work page arXiv 2023

[33] [34]

A convex formulation of frictional contact between rigid and deformable bodies

Xuchen Han, Joseph Masterjohn, and Alejandro Castro. A convex formulation of frictional contact between rigid and deformable bodies. IEEE Robotics and Automation Letters, 2023

work page 2023

[34] [35]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020

[35] [36]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [37]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[37] [38]

Dreamphysics: Learning physical properties of dynamic 3d gaussians with video diffusion priors

Tianyu Huang, Yihan Zeng, Hui Li, Wangmeng Zuo, and Rynson WH Lau. Dreamphysics: Learning physical properties of dynamic 3d gaussians with video diffusion priors. arXiv preprint arXiv:2406.01476, 2024

work page arXiv 2024

[38] [39]

Plasticinelab: A soft-body manipulation benchmark with differentiable physics

Zhiao Huang, Yuanming Hu, Tao Du, Siyuan Zhou, Hao Su, Joshua B Tenenbaum, and Chuang Gan. Plasticinelab: A soft-body manipulation benchmark with differentiable physics. arXiv preprint arXiv:2104.03311, 2021

work page arXiv 2021

[39] [40]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. arXiv preprint arXiv:2311.17982, 2023

work page arXiv 2023

[40] [41]

EulerDiscreteScheduler — huggingface.co

huggingfaceEulerDiscreteScheduler. EulerDiscreteScheduler — huggingface.co. https: //huggingface.co/docs/diffusers/en/api/schedulers/euler

work page

[41] [42]

Text2video-zero: Text-to-image diffusion mod- els are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion mod- els are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023

work page 2023

[42] [43]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[43] [44]

Drucker-prager elastoplasticity for sand animation

Gergely Klár, Theodore Gast, Andre Pradhana, Chuyuan Fu, Craig Schroeder, Chenfanfu Jiang, and Joseph Teran. Drucker-prager elastoplasticity for sand animation. ACM Transactions on Graphics (TOG), 35(4):1–12, 2016

work page 2016

[44] [45]

KLING AI — klingai.com

KlingAI. KLING AI — klingai.com. https://www.klingai.com/, 2024

work page 2024

[45] [46]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [47]

Smoothed particle hydrodynamics techniques for the physics based simulation of fluids and solids

Dan Koschier, Jan Bender, Barbara Solenthaler, and Matthias Teschner. Smoothed particle hydrodynamics techniques for the physics based simulation of fluids and solids. arXiv preprint arXiv:2009.06944, 2020

work page arXiv 2009

[47] [48]

Subjective-aligned dateset and metric for text-to-video quality assessment

Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, and Ning Liu. Subjective-aligned dateset and metric for text-to-video quality assessment. arXiv preprint arXiv:2403.11956, 2024

work page arXiv 2024

[48] [49]

Viescore: Towards explain- able metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explain- able metrics for conditional image synthesis evaluation. arXiv preprint arXiv:2312.14867, 2023. 14

work page arXiv 2023

[49] [50]

GitHub - LAION-AI/aesthetic-predictor: A linear estimator on top of clip to predict the aesthetic quality of pictures — github.com

LaionAI. GitHub - LAION-AI/aesthetic-predictor: A linear estimator on top of clip to predict the aesthetic quality of pictures — github.com. https://github.com/LAION-AI/ aesthetic-predictor, 2022

work page 2022

[50] [51]

Variational stokes: a unified pressure- viscosity solver for accurate viscous liquids

Egor Larionov, Christopher Batty, and Robert Bridson. Variational stokes: a unified pressure- viscosity solver for accurate viscous liquids. ACM Transactions on Graphics (TOG), 36(4):1– 11, 2017

work page 2017

[51] [52]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [53]

User experience rating scales with 7, 11, or 101 points: does it matter? Journal of Usability Studies, 12(2), 2017

James R Lewis and O˘guzhan Erdinç. User experience rating scales with 7, 11, or 101 points: does it matter? Journal of Usability Studies, 12(2), 2017

work page 2017

[53] [54]

Incremental potential contact: intersection- and inversion-free, large-deformation dynamics

Minchen Li, Zachary Ferguson, Teseo Schneider, Timothy R Langlois, Denis Zorin, Daniele Panozzo, Chenfanfu Jiang, and Danny M Kaufman. Incremental potential contact: intersection- and inversion-free, large-deformation dynamics. ACM Trans. Graph., 39(4):49, 2020

work page 2020

[54] [55]

Codimensional incremental potential contact

Minchen Li, Danny M Kaufman, and Chenfanfu Jiang. Codimensional incremental potential contact. arXiv preprint arXiv:2012.04457, 2020

work page arXiv 2012

[55] [56]

Aligning diffusion models by optimizing human utility

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Aligning diffusion models by optimizing human utility. arXiv preprint arXiv:2404.04465, 2024

work page arXiv 2024

[56] [57]

Energetically consistent inelasticity for optimiza- tion time integration

Xuan Li, Minchen Li, and Chenfanfu Jiang. Energetically consistent inelasticity for optimiza- tion time integration. ACM Transactions on Graphics (TOG), 41(4):1–16, 2022

work page 2022

[57] [58]

Gpu-accelerated robotic simulation for distributed reinforcement learning

Jacky Liang, Viktor Makoviychuk, Ankur Handa, Nuttapong Chentanez, Miles Macklin, and Dieter Fox. Gpu-accelerated robotic simulation for distributed reinforcement learning. In Conference on Robot Learning, pages 270–282. PMLR, 2018

work page 2018

[58] [59]

Evaluating text-to-visual generation with image-to-text generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. arXiv preprint arXiv:2404.01291, 2024

work page arXiv 2024

[59] [60]

Physics3d: Learning physical properties of 3d gaussians via video diffusion

Fangfu Liu, Hanyang Wang, Shunyu Yao, Shengjun Zhang, Jie Zhou, and Yueqi Duan. Physics3d: Learning physical properties of 3d gaussians via video diffusion. arXiv preprint arXiv:2406.04338, 2024

work page arXiv 2024

[60] [61]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024

work page 2024

[61] [62]

Physgen: Rigid-body physics-grounded image-to-video generation

Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation

work page

[62] [63]

Evalcrafter: Benchmarking and evaluating large video generation models

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22139–22149, 2024

work page 2024

[63] [64]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [65]

Effect of the number of response categories on the reliability and validity of rating scales

Luis M Lozano, Eduardo García-Cueto, and José Muñiz. Effect of the number of response categories on the reliability and validity of rating scales. Methodology, 4(2):73–79, 2008

work page 2008

[65] [66]

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm- solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022. 15

work page internal anchor Pith review Pith/arXiv arXiv 2022

[66] [67]

Physically- aware generative network for 3d shape modeling

Mariem Mezghanni, Malika Boulkenafed, Andre Lieutier, and Maks Ovsjanikov. Physically- aware generative network for 3d shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9330–9341, 2021

work page 2021

[67] [68]

mplug-owl-video

mplugowl. mplug-owl-video. https://github.com/X-PLUG/mPLUG-Owl/tree/main/ mPLUG-Owl/mplug_owl_video

work page

[68] [69]

Particle-based fluid-fluid interaction

Matthias Müller, Barbara Solenthaler, Richard Keiser, and Markus Gross. Particle-based fluid-fluid interaction. In Proceedings of the 2005 ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 237–244, 2005

work page 2005

[69] [70]

Phyrecon: Physically plausible neural scene reconstruction

Junfeng Ni, Yixin Chen, Bohan Jing, Nan Jiang, Bin Wang, Bo Dai, Yixin Zhu, Song-Chun Zhu, and Siyuan Huang. Phyrecon: Physically plausible neural scene reconstruction. arXiv preprint arXiv:2404.16666, 2024

work page arXiv 2024

[70] [71]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021

work page 2021

[71] [72]

Graphical modeling and animation of ductile fracture

James F O’brien, Adam W Bargteil, and Jessica K Hodgins. Graphical modeling and animation of ductile fracture. In Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pages 291–294, 2002

work page 2002

[72] [73]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023a, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[73] [74]

Gpt-4v(ision) system card, 2023b

OpenAI. Gpt-4v(ision) system card, 2023b. https://openai.com/research/ gpt-4v-system-card , 2023

work page 2023

[74] [75]

GitHub - hpcaitech/Open-Sora: Open-Sora: Democratizing Efficient Video Pro- duction for All — github.com

OpenSora. GitHub - hpcaitech/Open-Sora: Open-Sora: Democratizing Efficient Video Pro- duction for All — github.com. https://github.com/hpcaitech/Open-Sora, 2024

work page 2024

[75] [76]

Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models

Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, et al. Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models. arXiv preprint arXiv:2405.02287, 2024

work page arXiv 2024

[76] [77]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023

[77] [78]

Pika — pika.art

pika. Pika — pika.art. https://pika.art/

work page

[78] [79]

Intuitive physics learning in a deep-learning model inspired by developmental psychology

Luis S Piloto, Ari Weinstein, Peter Battaglia, and Matthew Botvinick. Intuitive physics learning in a deep-learning model inspired by developmental psychology. Nature human behaviour, 6(9):1257–1267, 2022

work page 2022

[79] [80]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[80] [81]

Power plastics: A hybrid lagrangian/eulerian solver for mesoscale inelastic flows

Ziyin Qu, Minchen Li, Yin Yang, Chenfanfu Jiang, and Fernando De Goes. Power plastics: A hybrid lagrangian/eulerian solver for mesoscale inelastic flows. ACM Transactions on Graphics (TOG), 42(6):1–11, 2023

work page 2023