arxiv: 2604.09415 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

Recognition: unknown

PhysInOne: Visual Physics Learning and Reasoning in One Suite

Bing Wang, Bowen Cheng, Bo Yang, Chuhang Zou, Chun Ho Yuen, Di Zhang, Dongsheng Wang, Haochen Hu, Hao Li, Hejun Wang, Hongkang Song, Hongtao Wen, Hu Cheng, Jiahao Chen, Jiayue Huang, Jinxi Li, Junwei Jiang, Kaiyuan Wang, Peng Huang, Peng Yun, Pok Kazaf Fu, Shangjia Liu, Shenxing Wei, Shijie Liu, Shiwei Mao, Shouwang Huang, Siyuan Zhou, Wai Kit Lai, Wenqi Zhou, Yafei Yang, Yitian Li, Yixiao Jin, Zhengli Hao, Zhihan Zhao, Zhihua Wang, Zhixuan Sun, Zihui Zhang, Ziqi Li, Zongqi He

Pith reviewed 2026-05-10 16:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO

keywords synthetic datasetphysics simulationvideo generationphysical reasoningAI world models3D scene annotationsmulti-object dynamicsfuture prediction

0 comments

The pith

PhysInOne supplies 2 million annotated videos of 153810 scenes covering 71 physical phenomena to train AI world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PhysInOne as a large synthetic dataset that fills the gap in physically grounded training examples for AI. It generates 2 million videos from dynamic 3D scenes with full annotations for geometry, motion, properties, and text, spanning mechanics, optics, fluids, and magnetism. Experiments fine-tune existing foundation models on this data for video generation, frame prediction, property estimation, and motion transfer. Results show gains in physical plausibility alongside persistent failures on complex interactions and intrinsic estimates. If correct, the scale of this resource could shift how AI systems acquire reliable physics understanding for simulation and embodied tasks.

Core claim

PhysInOne provides 2 million videos across 153810 dynamic 3D scenes covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism, with comprehensive ground-truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. Fine-tuning foundation models on PhysInOne significantly enhances physical plausibility in physics-aware video generation, long- and short-term future frame prediction, physical property estimation, and motion transfer, while exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties.

What carries the argument

The PhysInOne synthetic dataset, consisting of multi-object 3D scenes rendered as videos with dense physical ground-truth labels.

If this is right

Fine-tuned models generate videos with greater adherence to physical laws than models trained on smaller datasets.
Long- and short-term future frame prediction improves in accuracy for multi-object interactions.
Physical property estimation tasks, such as inferring mass or elasticity, become more reliable.
Motion transfer between objects succeeds more often while preserving physical constraints.
The dataset serves as a new benchmark scale for evaluating physics-grounded generation and simulation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The exposed gaps suggest that dataset scale alone may not suffice and could motivate hybrid training with real captured data or explicit physics modules.
Success in simulation could speed development of embodied AI agents that plan actions using learned physical priors before real-world deployment.
The annotation richness might support new self-supervised objectives that combine vision with language descriptions of physical rules.
Extending the same generation pipeline to additional phenomena or higher-fidelity rendering could test whether current limits are data-size or representation issues.

Load-bearing premise

Training on these simulated scenes will produce AI improvements that generalize to real-world physical reasoning without major mismatches from unstated simulation artifacts.

What would settle it

Measuring whether models fine-tuned on PhysInOne achieve measurably higher physical plausibility scores than baselines when tested on real-world videos of the same 71 phenomena.

Figures

Figures reproduced from arXiv: 2604.09415 by Bing Wang, Bowen Cheng, Bo Yang, Chuhang Zou, Chun Ho Yuen, Di Zhang, Dongsheng Wang, Haochen Hu, Hao Li, Hejun Wang, Hongkang Song, Hongtao Wen, Hu Cheng, Jiahao Chen, Jiayue Huang, Jinxi Li, Junwei Jiang, Kaiyuan Wang, Peng Huang, Peng Yun, Pok Kazaf Fu, Shangjia Liu, Shenxing Wei, Shijie Liu, Shiwei Mao, Shouwang Huang, Siyuan Zhou, Wai Kit Lai, Wenqi Zhou, Yafei Yang, Yitian Li, Yixiao Jin, Zhengli Hao, Zhihan Zhao, Zhihua Wang, Zhixuan Sun, Zihui Zhang, Ziqi Li, Zongqi He.

**Figure 1.** Figure 1: We present PhysInOne, a large-scale dataset of 153,810 dynamic 3D scenes and 2 million annotated videos, systematically [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Spanning 71 basic physical phenomena scaled to 3,284 multiphysics activities, PhysInOne comprises 153,810 unique scenes [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative examples demonstrating improved physical plausibility in videos generated after fine-tuning on PhysInOne. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative examples of long-term future frame prediction by current methods for trained viewpoints. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative resimulation results using estimated physical properties. Both baselines fail to accurately infer properties for complex [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative motion transfer results from GoWithTheFlow and MotionPro. Generated frames retain visual realism but fail to transfer [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Exemplary 3D asset under CC BY License [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Exemplary 3D asset under UE Standard License [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Examples of materials. Solid Objects Interactable Objects Destructible Objects Deformable Objects Granular Objects Liquid [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Examples of 3D assets. provide contextual geometry, including diverse categories, such as bathroom, kitchen, etc.. Step 3: Placing Multiobjects Multiple objects are placed against the background. We set objects that will not be driven by the liquid as solid (e.g., the container of the fluid), while objects that can be [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: The pipeline to create 3D scenes in Unreal Engine. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: The pipeline to create 3D scenes concerning liquid. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: The pipeline to create 3D scenes concerning special materials. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Static Camera Sampling Circular Loop Trajectory Sampling on Sphere: This method samples points along a circular trajectory on a sphere. A loop center is randomly chosen within hemisphere constraints, and a circle of adjustable size is defined by a loop intensity parameter. The trajectory is parameterized by n evenly spaced angles, generating latitude and longitude offsets that form a closed loop around … view at source ↗

**Figure 15.** Figure 15: Linear Drift Sampling [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Sinusoidal Interpolation Sampling [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Circular Loop Trajectory Sampling on Sphere [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: The distribution of video lengths. as a whole (e.g., conservation of momentum, elastic collision). With accurate human annotations, we refine the descriptions using Qwen3-VL-235B-A22B-Thinking, a large language model, to correct grammatical errors, improve clarity, and enhance completeness. During this polishing step, we provide an additional prompt that emphasizes object appearance details and explici… view at source ↗

**Figure 19.** Figure 19: Demonstration for PMF. In the top-left pair, the only variance is the initial spatial location where the object begins to fall. As [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

read the original abstract

We present PhysInOne, a large-scale synthetic dataset addressing the critical scarcity of physically-grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ground-truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. We demonstrate PhysInOne's efficacy across four emerging applications: physics-aware video generation, long-/short-term future frame prediction, physical property estimation, and motion transfer. Experiments show that fine-tuning foundation models on PhysInOne significantly enhances physical plausibility, while also exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties. As the largest dataset of its kind, orders of magnitude beyond prior works, PhysInOne establishes a new benchmark for advancing physics-grounded world models in generation, simulation, and embodied AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces PhysInOne, a synthetic dataset comprising 2 million videos from 153,810 dynamic 3D scenes that cover 71 physical phenomena across mechanics, optics, fluid dynamics, and magnetism. Scenes include multi-object interactions with complex backgrounds and rich ground-truth annotations (3D geometry, semantics, motion, physical properties, text). The authors evaluate the dataset on four tasks—physics-aware video generation, short-/long-term future prediction, physical property estimation, and motion transfer—claiming that fine-tuning foundation models on PhysInOne yields significant gains in physical plausibility while revealing gaps in complex dynamics and intrinsic-property estimation.

Significance. If the simulator faithfully reproduces the targeted phenomena and the reported gains transfer beyond the synthetic distribution, PhysInOne would constitute a substantial resource: its scale (orders of magnitude larger than prior physics datasets) and breadth of annotated phenomena could accelerate development of physics-grounded world models for generation, simulation, and embodied AI. The comprehensive annotation suite and multi-application evaluation are strengths.

major comments (3)

[Experiments] Experiments section: the abstract and results claim that fine-tuning on PhysInOne 'significantly enhances physical plausibility' and 'exposes critical gaps,' yet no quantitative metrics, baseline comparisons, error bars, or statistical tests are provided to support these statements. This absence makes it impossible to assess the magnitude or reliability of the claimed improvements.
[Dataset] Dataset construction / simulator description: the central claim that the 153,810 scenes provide faithful ground truth for 71 phenomena rests on the unverified assumption that the underlying physics engine reproduces the targeted dynamics without systematic artifacts. No quantitative validation against analytical solutions, closed-form expressions, or real-world footage is reported for any subset of the phenomena.
[Experiments] Evaluation protocol: all four application experiments appear to be conducted entirely within the synthetic distribution. The absence of any held-out real-world test set or cross-domain transfer experiment leaves the generalization claim—that improvements will benefit real-world physical reasoning—untested and therefore load-bearing for the paper's broader impact argument.

minor comments (2)

[Dataset] Clarify the exact procedure used to generate the 2 million videos from the 153,810 scenes (e.g., number of trajectories per scene, camera sampling strategy) so that the dataset scale can be reproduced.
[Introduction] The abstract states the dataset is 'orders of magnitude beyond prior works'; a concise table comparing scene count, video count, and phenomenon coverage against the most relevant existing datasets would strengthen this claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment point by point below, with revisions to the manuscript where the concerns are valid.

read point-by-point responses

Referee: [Experiments] Experiments section: the abstract and results claim that fine-tuning on PhysInOne 'significantly enhances physical plausibility' and 'exposes critical gaps,' yet no quantitative metrics, baseline comparisons, error bars, or statistical tests are provided to support these statements. This absence makes it impossible to assess the magnitude or reliability of the claimed improvements.

Authors: We agree that the original submission insufficiently quantified the claimed improvements. The revised manuscript now includes a substantially expanded Experiments section with new tables reporting concrete metrics for each of the four tasks (e.g., FVD and physical-plausibility scores for video generation; MSE and long-term prediction accuracy; property-estimation error rates; motion-transfer success rates). All results include baseline comparisons (models trained without PhysInOne or on prior smaller physics datasets), error bars computed over five independent runs with different random seeds, and statistical significance tests (paired t-tests with p-values). These additions directly support the statements in the abstract and allow readers to evaluate the magnitude and reliability of the gains. revision: yes
Referee: [Dataset] Dataset construction / simulator description: the central claim that the 153,810 scenes provide faithful ground truth for 71 phenomena rests on the unverified assumption that the underlying physics engine reproduces the targeted dynamics without systematic artifacts. No quantitative validation against analytical solutions, closed-form expressions, or real-world footage is reported for any subset of the phenomena.

Authors: We acknowledge that explicit quantitative validation of the simulator was missing. In the revised Dataset section we have added a dedicated 'Simulator Fidelity Validation' subsection. For a representative subset of phenomena we now report: (i) trajectory and collision errors versus closed-form analytical solutions for rigid-body mechanics (mean position error <4% across 500 test cases); (ii) ray-tracing accuracy against Snell's law and reflection formulas for optics; and (iii) qualitative side-by-side comparisons with real-world footage for selected fluid and magnetic interactions, accompanied by per-frame annotation consistency checks. We also explicitly discuss known limitations of the engine for highly chaotic or multi-scale phenomena. revision: yes
Referee: [Experiments] Evaluation protocol: all four application experiments appear to be conducted entirely within the synthetic distribution. The absence of any held-out real-world test set or cross-domain transfer experiment leaves the generalization claim—that improvements will benefit real-world physical reasoning—untested and therefore load-bearing for the paper's broader impact argument.

Authors: We agree that all reported experiments remain within the synthetic distribution. The revised manuscript now contains a new 'Limitations and Broader Impact' section that explicitly states the synthetic scope of the evaluations and tempers generalization claims. We discuss the sim-to-real gap (lighting, texture, sensor noise) and outline why full real-world transfer experiments lie beyond the present scope. While we cannot add comprehensive real-world test sets in this revision, we have included a small-scale qualitative transfer illustration on publicly available real physics videos to illustrate the direction of future work. revision: partial

Circularity Check

0 steps flagged

No circularity; dataset creation and empirical evaluation are independent of inputs

full rationale

The paper introduces PhysInOne as a new synthetic dataset with specified scale, coverage of 71 phenomena, and annotations, then reports experimental outcomes from fine-tuning models on it for generation, prediction, estimation, and transfer tasks. No derivation chain, equations, or first-principles claims exist that could reduce to fitted parameters, self-definitions, or self-citations. All load-bearing statements concern the dataset's construction and measured performance deltas, which are presented as empirical observations rather than tautological outputs. Any self-citations serve only as background and do not underpin uniqueness theorems or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's value rests on the creation of this dataset and the premise that its synthetic physics data provides useful training signal; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Synthetic 3D scenes generated with physical rules can serve as effective proxies for real-world physical phenomena in AI training
Invoked implicitly when claiming the dataset addresses scarcity of physically-grounded training data and improves real plausibility.

pith-pipeline@v0.9.0 · 5641 in / 1325 out tokens · 96953 ms · 2026-05-10T16:36:03.751637+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

111 extracted references · 27 canonical work pages · 11 internal anchors

[1]

MAGI-1: Autoregressive Video Generation at Scale

Sand AI. MAGI-1: Autoregressive Video Generation at Scale. arXiv:2505.13211, 2025. 2, 6, 7, 15

work page internal anchor Pith review arXiv 2025
[2]

Samil Atesoglu, Cagatay Yigit, Ilker Kesen, Mert Kobas, Erkut Erdem, Aykut Erdem, Tilbe Goksun, and Deniz Yuret

Tayfun Ates, M. Samil Atesoglu, Cagatay Yigit, Ilker Kesen, Mert Kobas, Erkut Erdem, Aykut Erdem, Tilbe Goksun, and Deniz Yuret. CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions. ACL Findings, 2022. 3

2022
[3]

Baieri, D

D. Baieri, D. Crisostomi, S. Esposito, F. Maggioli, and E. Rodola. Efficient Generation of Multimodal Fluid Simula- tion Data. arXiv:2311.06284, 2023. 3

work page arXiv 2023
[4]

PHYRE: A New Benchmark for Physical Reasoning

Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Girshick. PHYRE: A New Benchmark for Physical Reasoning. NeurIPS, 2019. 3

2019
[5]

Krish- nan

Vahid Balazadeh, Mohammadmehdi Ataei, Hyunmin Cheong, Amir Hosein Khasahmadi, and Rahul G. Krish- nan. Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models. ICCV,
[6]

VideoPhy: Evaluating Phys- ical Commonsense for Video Generation

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai- Wei Chang, and Aditya Grover. VideoPhy: Evaluating Phys- ical Commonsense for Video Generation. ICLR, 2025. 3, 5

2025
[7]

Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Golden- berg, Aditya Grover, and Kai-Wei Chang. VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evalua- tion in Video Generation. arXiv:2503.06800, 2025. 3

work page arXiv 2025
[8]

CoPhy: Counterfactual Learning of Physical Dynamics

Fabien Baradel, Natalia Neverova, Julien Mille, Greg Mori, and Christian Wolf. CoPhy: Counterfactual Learning of Physical Dynamics. ICLR, 2022. 3

2022
[9]

Bear, Elias Wang, Damian Mrowca, Felix Binder, Hsiao Yu Fish Tung, R

Daniel M. Bear, Elias Wang, Damian Mrowca, Felix Binder, Hsiao Yu Fish Tung, R. T. Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan Yun Sun, Li Fei-Fei, Nancy Kanwisher, Joshua B. Tenenbaum, Daniel L.K. Yamins, and Judith Fan. Physion: Evaluating Physical Prediction from Vision in Humans and Machines. NeurIPS, 2021. 3

2021
[10]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jam- pani, and Robin Rombach. Stable Video Diffusion: Scal- ing Latent Video Diffusion Models to Large Datasets. arXiv:2311.15127, 2023. 2, 5, 6, 12

work page internal anchor Pith review arXiv 2023
[11]

https://www.blenderkit.com/

BlenderKit. https://www.blenderkit.com/. 4
[12]

Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments

Florian Bordes, Quentin Garrido, Justine T Kao, Adina Williams, Michael Rabbat, and Emmanuel Dupoux. Int- Phys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments. arXiv:2506.09849, 2025. 3

work page arXiv 2025
[13]

Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise

Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingx- iao Li, Mohsen Mousavi, Michael Ryoo, Paul Debevec, and Ning Yu. Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise. CVPR,
[14]

Gaussian- Informed Continuum for Physical Property Identification and Simulation

Junhao Cai, Yuji Yang, Weihao Yuan, Yisheng He, Zilong Dong, Liefeng Bo, Hui Cheng, and Qifeng Chen. Gaussian- Informed Continuum for Physical Property Identification and Simulation. NeurIPS, 2024. 2, 3, 7, 8, 15, 16

2024
[15]

Sophy: Learning to generate simulation-ready objects with physical materials,

Junyi Cao and Evangelos Kalogerakis. SOPHY: Learning to Generate Simulation-Ready Objects with Physical Materials. arXiv:2504.12684, 2025. 3

work page arXiv 2025
[16]

PhysX- 3D: Physical-Grounded 3D Asset Generation

Ziang Cao, Zhaoxi Chen, Liang Pan, and Ziwei Liu. PhysX- 3D: Physical-Grounded 3D Asset Generation. NeurIPS,
[17]

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Eka- terina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, and Sergey Tulyakov. Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers. CVPR, 2024. 2

2024
[18]

Tenenbaum, and Chuang Gan

Zhenfang Chen, Kexin Yi, Yunzhu Li, Mingyu Ding, An- tonio Torralba, Joshua B. Tenenbaum, and Chuang Gan. ComPhy: Compostional Physical Reasoning of Objects and Events from Videos. ICLR, 2022. 3

2022
[19]

LLMPhy: Parameter-Identifiable Physical Reasoning Combining Large Language Models and Physics Engines

Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres. LLMPhy: Complex Physical Reason- ing Using Large Language Models and World Models. arXiv:2411.08027, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Un- derstanding

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Un- derstanding. ICLR, 2025. 3, 6

2025
[21]

https://commoncrawl.org/the-data/

Common Crawl. https://commoncrawl.org/the-data/. 2
[22]

V oMP: Predicting V olumet- ric Mechanical Property Fields

Rishit Dagli, Donglai Xiang, Vismay Modi, Charles Loop, Clement Fuji, Anka He, Chen Anita, Hu Gavriel, State David, and Maria Shugrina. V oMP: Predicting V olumet- ric Mechanical Property Fields. arXiv:2510.22975, 2025. 3

work page arXiv 2025
[23]

Objaverse-XL: A Universe of 10M+ 3D Objects

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Anirud- dha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-XL: A Universe of 10M+ 3D Objects. NeurIPS, 2023. 2

2023
[24]

Objaverse: A Universe of Annotated 3D Objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A Universe of Annotated 3D Objects. CVPR, 2023. 2

2023
[25]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. CVPR, 2009. 2

2009
[26]

Understanding World or Predicting Future? A Comprehensive Survey of World Models

Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Ze- fang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, Fengli Xu, and Yong Li. Understanding World or Predicting Future? A Comprehensive Survey of World Models. ACM Computing Surveys, 2025. 2

2025
[27]

https://www.doriflow.com/

Doriflow. https://www.doriflow.com/. 5, 7
[28]

PIP: Physical Interaction Prediction via Mental Simulation with Span Selection

Jiafei Duan, Samson Yu, Soujanya Poria, Bihan Wen, and Cheston Tan. PIP: Physical Interaction Prediction via Mental Simulation with Span Selection. ECCV, 2022. 3

2022
[29]

ScalarFlow: A large-scale volumetric data set of real-world scalar transport flows for computer animation and machine learning

Marie Lena Eckert, Kiwon Um, and Nils Thuerey. ScalarFlow: A large-scale volumetric data set of real-world scalar transport flows for computer animation and machine learning. TOG, 2019. 3

2019
[30]

https://www.fab.com/

FAB. https://www.fab.com/. 4, 3
[31]

Fast Dynamic Radiance Fields with Time-Aware Neural V oxels

Jiemin Fang, Xinggang Wang, and Matthias Nießner. Fast Dynamic Radiance Fields with Time-Aware Neural V oxels. SIGGRAPH Asia, 2022. 2, 6, 15

2022
[32]

Fatehi and M.T

R. Fatehi and M.T. Manzari. Error estimation in smoothed particle hydrodynamics and a new scheme for second deriva- tives. Computers & Mathematics with Applications, 2011. 5

2011
[33]

Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, and Sherry Yang. Im- proving Dynamic Object Interactions in Text-to-Video Gen- eration with AI Feedback. arXiv:2412.02617, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems

Alejandro Casta ˜neda Garcia, Jan Warchocki, Jan Van Gemert, Daan Brinks, and Nergis Tomen. Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems. CVPR, 2025. 2

2025
[35]

Smoothed particle hydro- dynamics: theory and application to non-spherical stars

RA Gingold and JJ Monaghan. Smoothed particle hydro- dynamics: theory and application to non-spherical stars. Monthly notices of the royal astronomical society, 1977. 4

1977
[36]

Fuchs, Ingmar Posner, and Andrea Vedaldi

Oliver Groth, Fabian B. Fuchs, Ingmar Posner, and Andrea Vedaldi. ShapeStacks: Learning Vision-Based Physical Intuition for Generalised Object Stacking. ECCV, 2018. 3

2018
[37]

phyworldbench

Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, and Xin Eric Wang. PhyWorldBench: A Comprehensive Evaluation of Physical Realism in Text-to- Video Models. arXiv:2507.13428, 2025. 3

work page arXiv 2025
[38]

NeuroFluid: Fluid Dynamics Grounding with Particle- Driven Neural Radiance Fields

Shanyan Guan, Huayu Deng, Yunbo Wang, and Xiaokang Yang. NeuroFluid: Fluid Dynamics Grounding with Particle- Driven Neural Radiance Fields. ICML, 2022. 3

2022
[39]

Funda- mentals of Physics, Extended, 12th Edition

David Halliday, Robert Resnick, and Jearl Walker. Funda- mentals of Physics, Extended, 12th Edition. 2021. 2

2021
[40]

VIDEOSCORE: Build- ing Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bo- han Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Bill Yuchen Lin, and Wenhu Chen. VIDEOSCORE: Build- ing Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation. ...

2024
[41]

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard and Sebastian Ruder. Universal Language Model Fine-tuning for Text Classification. ACL, 2018. 5

2018
[42]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685, 2021. 5

work page internal anchor Pith review Pith/arXiv arXiv 2021
[43]

A Moving Least Squares Material Point Method with Displacement Discon- tinuity and Two-Way Rigid Body Coupling

Yuanming Hu, Yu Fang, Ziheng Ge, Ziyin Qu, Yixin Zhu, Andre Pradhana, and Chenfanfu Jiang. A Moving Least Squares Material Point Method with Displacement Discon- tinuity and Two-Way Rigid Body Coupling. SIGGRAPH,
[44]

GRASP: A Novel Benchmark for Evaluating Language GRounding and Situ- ated Physics Understanding in Multimodal Language Mod- els

Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, and Elia Bruni. GRASP: A Novel Benchmark for Evaluating Language GRounding and Situ- ated Physics Understanding in Multimodal Language Mod- els. IJCAI, 2024. 3

2024
[45]

PhysTwin: Physics- Informed Reconstruction and Simulation of Deformable Objects from Videos

Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang, Hsin-Ni Yu, Shenlong Wang, and Yunzhu Li. PhysTwin: Physics- Informed Reconstruction and Simulation of Deformable Objects from Videos. ICCV, 2025. 3

2025
[46]

Improving Physics-Augmented Contin- uum Neural Radiance Field-Based Geometry-Agnostic Sys- tem Identification with Lagrangian Particle Optimization

Takuhiro Kaneko. Improving Physics-Augmented Contin- uum Neural Radiance Field-Based Geometry-Agnostic Sys- tem Identification with Lagrangian Particle Optimization. CVPR, 2024. 2

2024
[47]

How Far is Video Generation from World Model: A Physical Law Per- spective

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How Far is Video Generation from World Model: A Physical Law Per- spective. ICML, 2025. 2, 5

2025
[48]

Pixie: Fast and generalizable supervised learning of 3d physics from pixels,

Long Le, Ryan Lucas, Chen Wang, Chuhao Chen, Dinesh Jayaraman, Eric Eaton, and Lingjie Liu. Pixie: Fast and Gen- eralizable Supervised Learning of 3D Physics from Pixels. arXiv:2508.17437, 2025. 3

work page arXiv 2025
[49]

What about gravity in video generation? Post-Training Newton’s Laws with Verifiable Rewards

Minh-Quan Le, Yuanzhi Zhu, Vicky Kalogeiton, and Dim- itris Samaras. What about gravity in video generation? Post-Training Newton’s Laws with Verifiable Rewards. arXiv:2512.00425, 2025. 3

work page arXiv 2025
[50]

PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watch- ing Stuff Drop

Chenyu Li, Oscar Michel, Xichen Pan, Sainan Liu, Mike Roberts, and Saining Xie. PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watch- ing Stuff Drop. ICML, 2025. 3

2025
[51]

Worldmodelbench: Judging video generation models as world models

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, Ion Stoica, Song Han, and Yao Lu. WorldModelBench: Judging Video Generation Models As World Models. arXiv:2502.20694, 2025. 3

work page arXiv 2025
[52]

NVFi: Neural Veloc- ity Fields for 3D Physics Learning from Dynamic Videos

Jinxi Li, Ziyang Song, and Bo Yang. NVFi: Neural Veloc- ity Fields for 3D Physics Learning from Dynamic Videos. NeurIPS, 2023. 2

2023
[53]

TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos

Jinxi Li, Ziyang Song, and Bo Yang. TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos. ICCV, 2025. 2, 6, 15

2025
[54]

FreeGave: 3D Physics Learning from Dynamic Videos by Gaussian Velocity

Jinxi Li, Ziyang Song, Siyuan Zhou, and Bo Yang. FreeGave: 3D Physics Learning from Dynamic Videos by Gaussian Velocity. CVPR, 2025. 2, 6, 7, 15

2025
[55]

PAC-NeRF: Physics Augmented Continuum Neural Radi- ance Fields for Geometry-Agnostic System Identification

Xuan Li, Yi-Ling Qiao, Peter Yichen Chen, Krishna Murthy Jatavallabhula, Ming Lin, Chenfanfu Jiang, and Chuang Gan. PAC-NeRF: Physics Augmented Continuum Neural Radi- ance Fields for Geometry-Agnostic System Identification. ICLR, 2023. 2, 3, 7, 8, 15, 16

2023
[56]

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding? ICCV,

Yun Li, Yiming Zhang, Tao Lin, Xiangrui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding? ICCV,
[57]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context. ECCV, 2014. 2

2014
[58]

Om- niPhysGS: 3D Constitutive Gaussians for General Physics- Based Dynamics Generation

Yuchen Lin, Chenguo Lin, Jianjin Xu, and Yadong Mu. Om- niPhysGS: 3D Constitutive Gaussians for General Physics- Based Dynamics Generation. ICLR, 2025. 2

2025
[59]

Physics3d: Learning physical properties of 3d gaussians via video diffusion

Fangfu Liu, Hanyang Wang, Shunyu Yao, and Shengjun Zhang. Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion. arXiv:2406.04338, 2024. 2

work page arXiv 2024
[60]

XPBD: position-based simulation of compliant constrained dynamics

Miles Macklin, Matthias Muller, and Nuttapong Chentanez. XPBD: position-based simulation of compliant constrained dynamics. MIG, 2016. 5

2016
[61]

Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802,

Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yi- ran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, and Ping Luo. PhyBench: A Physical Com- monsense Benchmark for Evaluating Text-to-Image Models. arXiv:2406.11802, 2024. 3

work page arXiv 2024
[62]

Towards World Simulator: Crafting Physi- cal Commonsense-Based Benchmark for Video Generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quan- feng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards World Simulator: Crafting Physi- cal Commonsense-Based Benchmark for Video Generation. ICML, 2025. 2, 3, 6

2025
[63]

Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models learn phys- ical principles from watching videos? arXiv:2501.09038,

work page arXiv
[64]

GPT-4 Technical Report

OpenAI. GPT-4 Technical Report. arXiv:2303.08774, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Video generation models as world simulators

OpenAI. Video generation models as world simulators. 2024. 2

2024
[66]

Probing Physics Knowledge Using Tools from Developmental Psychology

Luis Piloto, Ari Weinstein, Dhruva TB, Arun Ahuja, Mehdi Mirza, Greg Wayne, David Amos, Chia-chun Hung, and Matt Botvinick. Probing Physics Knowledge Using Tools from Developmental Psychology. arXiv:1804.01128, 2018. 3

work page arXiv 2018
[67]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 DA VIS challenge on video object segmentation. arXiv:1704.00675, 2017. 8

work page internal anchor Pith review arXiv 2017
[68]

Generating Physically Stable and Buildable Brick Structures from Text

Ava Pun, Kangle Deng, Ruixuan Liu, Deva Ramanan, Changliu Liu, and Jun-Yan Zhu. Generating Physically Stable and Buildable Brick Structures from Text. ICCV,
[69]

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

Viorica Pˇatrˇaucean, Lucas Smaira, Ankush Gupta, Adri`a Re- casens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osin- dero, Dima Dam...

2023
[70]

ESPRIT: Explaining solutions to physical reasoning tasks.ACL, 2020

Nazneen Fatema Rajani, Rui Zhang, Yi Chern Tan, Stephan Zheng, Jeremy Weiss, Aadit Vyas, Abhijit Gupta, Caiming Xiong, Richard Socher, and Dragomir Radev. ESPRIT: Explaining solutions to physical reasoning tasks.ACL, 2020. 3

2020
[71]

Image Based Reconstruction of Liquids from 2D Surface Detec- tions

Florian Richter, Ryan K Orosco, and Michael C Yip. Image Based Reconstruction of Liquids from 2D Surface Detec- tions. CVPR, 2022. 3

2022
[72]

IntPhys 2019: A Framework for Visual Intuitive Physics Understanding

Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, V´eronique Izard, and Emmanuel Dupoux. IntPhys 2019: A Framework for Visual Intuitive Physics Understanding. TPAMI, 2022. 3

2019
[73]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text model...

2022
[74]

Ful ´e, and Erik Blasch

Alireza Shamsoshoara, Fatemeh Afghah, Abolfazl Razi, Liming Zheng, Peter Z. Ful ´e, and Erik Blasch. Aerial im- agery pile burn detection using deep learning: The FLAME dataset. Computer Networks, 2021. 3

2021
[75]

Materialistic: Selecting Similar Materials in Images

Prafull Sharma, Julien Philip, Micha ¨el Gharbi, Bill Free- man, Fredo Durand, and Valentin Deschaintre. Materialistic: Selecting Similar Materials in Images. TOG, 2023. 3

2023
[76]

Phyx: Does your model have the” wits” for physical reasoning?arXiv preprint arXiv:2505.15929, 2025

Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, Zhongwei Wan, Kai Zhang, Wendong Xu, Jing Xiong, Ping Luo, Wenhu Chen, Chaofan Tao, Zhuoqing Mao, and Ngai Wong. PhyX: Does Your Model Have the ”Wits” for Physical Reasoning? arXiv:2505.15929, 2025. 3, 6

work page arXiv 2025
[77]

https://sketchfab.com/

Sketchfab. https://sketchfab.com/. 4, 3
[78]

Smith, Lingjie Mei, Shunyu Yao, Jiajun Wu, Eliza- beth Spelke, Joshua B

Kevin A. Smith, Lingjie Mei, Shunyu Yao, Jiajun Wu, Eliza- beth Spelke, Joshua B. Tenenbaum, and Tomer D. Ullman. Modeling expectation violation in intuitive physics with coarse probabilistic object representations. NeurIPS, 2019. 3

2019
[79]

https://www.taichi-lang.org/

Taichi Lang. https://www.taichi-lang.org/. 5
[80]

Tenenbaum, Daniel LK Yamins, Judith E Fan, and Kevin A

Hsiao-Yu Tung, Mingyu Ding, Zhenfang Chen, Daniel Bear, Chuang Gan, Joshua B. Tenenbaum, Daniel LK Yamins, Judith E Fan, and Kevin A. Smith. Physion++: Evaluating Physical Scene Understanding that Requires Online Infer- ence of Different Physical Properties. NeurIPS, 2023. 3

2023

Showing first 80 references.