pith. sign in

arxiv: 2606.26694 · v1 · pith:QMDB3662new · submitted 2026-06-25 · 💻 cs.CV

PhysEditWorld: A Large-Scale Dataset Toward Physics-Editable World Models

Pith reviewed 2026-06-26 05:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords PhysEditWorldphysics-editable world modelsgravityUE5 datasetmultimodal datadynamics modelingreplay paradigmcontrollable simulation
0
0 comments X

The pith

PhysEditWorld supplies over 60 million frames of identical actions replayed under multiple gravity settings to train world models on explicit rather than implicit physics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current game world models produce visually plausible rollouts but treat physical dynamics as hidden correlations learned from data, which prevents deliberate editing of rules in authored environments. The paper builds PhysEditWorld around a replay pipeline that records normalized action traces, camera policies, and initial states once, then re-renders the identical sequences under different gravity values in 12 UE5 scenes. Each of the more than 60 million frames carries synchronized RGB, depth, normals, audio, actions, engine states, semantic labels, and explicit gravity values. This controlled variation lets models learn gravity as a manipulable input instead of an emergent pattern. A reader would care because the same structure could support authored, rule-consistent simulations in games and virtual environments where physics must remain predictable after edits.

Core claim

PhysEditWorld is constructed via a UE5 replay-and-rendering pipeline that fixes action traces, character controllers, and camera policies while varying only gravity across multiple configurations, thereby producing attributable physical changes across more than 100 hours of interaction and 60 million multimodal frames that include explicit gravity labels alongside RGB, depth, normals, audio, and semantic annotations.

What carries the argument

The replay paradigm that records a single normalized action trace and camera policy then replays the identical initial state under multiple gravity configurations inside the UE5 engine.

If this is right

  • Generative video models achieve improved gravity-faithful dynamics.
  • World understanding models exhibit greater consistency when physical parameters are edited.
  • The dataset supplies a scalable base for research on controllable rather than purely predictive world models.
  • Explicit gravity labels allow training regimes that treat physics parameters as inputs rather than latent correlations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The replay structure could be reused for other parameters such as friction or bounciness once the gravity version demonstrates clean attribution.
  • Mid-sequence gravity edits become testable because the dataset already isolates the effect of a single parameter change.
  • If UE5 physics matches real-world behavior closely enough, models trained here could transfer to robotic or simulation tasks that require gravity-aware planning.
  • The multimodal labels open the possibility of joint training on vision, audio, and state signals for more robust physics inference.

Load-bearing premise

Replaying fixed action traces and camera policies under changed gravity produces physical differences that can be attributed to gravity alone without engine-side side effects or rendering artifacts.

What would settle it

A model trained on PhysEditWorld predicts rollout frames under a held-out gravity value that deviate from the ground-truth UE5 simulation by the same margin as a model trained without gravity labels.

Figures

Figures reproduced from arXiv: 2606.26694 by Bin Hu, Chun-Mei Tseng, Hanhui Li, Haoning Wu, Jiaya Jia, Jiehui Huang, Jun Zhou, Kaipeng Zhang, Ruicheng Zhang, Shengju Qian, Xiaodan Liang, Xiu Li, Yanwen Ma, Yaokun Li, Yuechen Zhang, Zijun Wang, ZiLiang Zhang.

Figure 1
Figure 1. Figure 1: Overview of PhysEditWorld. The dataset contains 12 cinematic UE5 scenes, 100+ hours of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the PhysEditWorld data pipeline. A) Input sources and scenario library provide [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results under a free-fall prompt. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: First-person world-model case study. First frame is generated by GPT-Image2. Baseline [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The 12 curated UE5 scene assets used in PhysEditWorld. The asset library covers indoor, [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Character assets used in the current PhysEditWorld pipeline. Characters are configured [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Character-centric multi-camera rig used for synchronized capture. (A) Cameras are [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Socket and spring-arm based camera setup. (A) The character blueprint contains multiple [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Movie Render Graph configuration for synchronized multimodal export. The graph [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: UE Editor Plugin and Data Generation Workflow [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: More PhysicsEdit-Data 25 [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: More PhysicsEdit-Data with other physics config like friction, bounce coefficient [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: More PhysicsEdit-Data with other physics config like friction. bounce coefficient [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
read the original abstract

Recent game world models can synthesize visually plausible, action-conditioned rollouts. However, their interaction behaviors often remain limited to exploratory or wandering trajectories, and physical dynamics are typically learned as implicit correlations from data rather than as controllable variables. This limitation hinders their applicability to authored game environments, where physical rules are deliberately designed and require explicit manipulation. We introduce PhysEditWorld, a multimodal dataset with physical parameters, with a primary focus on gravity in this initial version. At its core, PhysEditWorld is built upon a replay paradigm implemented with a UE5 replay-and-rendering pipeline. Each scenario records a normalized action trace and replays the same initial state, character controller, action sequence, and camera policy under multiple gravity configurations, enabling controlled and attributable physical variation. PhysEditWorld contains 12 cinematic UE5 scenes, over 100 hours of gameplay interactions, and more than 60 million rendered rollout frames. Each sample provides synchronized multimodal signals, including RGB, depth, normals, audio, action traces, camera trajectory, engine states, semantic annotations, and explicit gravity labels. We further conduct initial utility studies on both generative video models and world understanding models, demonstrating that PhysEditWorld enables improved gravity-faithful dynamics modeling, enhances consistency under physical edits, and provides a scalable foundation for controllable world modeling research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PhysEditWorld, a multimodal dataset of over 100 hours of UE5 gameplay interactions (60M+ frames) across 12 cinematic scenes. It is constructed via a replay paradigm that records normalized action traces and camera policies from an initial state and replays them under multiple gravity configurations while providing synchronized RGB, depth, normals, audio, actions, camera trajectories, engine states, semantic annotations, and explicit gravity labels. The central claim is that this construction supplies clean, attributable physical variation, and that initial utility studies on generative video models and world-understanding models show the dataset enables improved gravity-faithful dynamics modeling and consistency under physical edits.

Significance. If the replay construction isolates gravity effects without confounding engine or rendering changes, the dataset would supply a scalable, labeled resource for training world models that treat physical parameters as explicit, editable variables rather than implicit correlations learned from exploratory trajectories.

major comments (2)
  1. [Abstract] Abstract: the claim that 'utility studies ... demonstrating that PhysEditWorld enables improved gravity-faithful dynamics modeling' is unsupported by any quantitative results, baselines, error metrics, or ablation tables; without these the central utility assertion cannot be evaluated.
  2. [Abstract] Abstract (replay paradigm paragraph): the assertion of 'controlled and attributable physical variation' rests on the assumption that only gravity changes across replays, yet the manuscript provides no verification that UE5 engine states, collision responses, animation state machines, or rendering pipelines remain invariant when gravity is altered; engine states are supplied but constancy is not demonstrated.
minor comments (1)
  1. The scale figures (12 scenes, >100 hours, >60M frames) are stated without breakdown by scene or gravity setting, making it difficult to assess coverage and balance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'utility studies ... demonstrating that PhysEditWorld enables improved gravity-faithful dynamics modeling' is unsupported by any quantitative results, baselines, error metrics, or ablation tables; without these the central utility assertion cannot be evaluated.

    Authors: We agree that the abstract phrasing is too strong given the preliminary nature of the studies. Section 5 presents initial experiments on generative video models (with qualitative rollouts) and world-understanding models (with gravity classification accuracy), but these lack the full quantitative baselines, error metrics, and ablations the referee correctly notes are needed for a robust claim. We will revise the abstract to describe the studies as 'preliminary' and point explicitly to Section 5, and we will expand the experiments in revision with additional quantitative metrics and comparisons. revision: yes

  2. Referee: [Abstract] Abstract (replay paradigm paragraph): the assertion of 'controlled and attributable physical variation' rests on the assumption that only gravity changes across replays, yet the manuscript provides no verification that UE5 engine states, collision responses, animation state machines, or rendering pipelines remain invariant when gravity is altered; engine states are supplied but constancy is not demonstrated.

    Authors: The referee is correct that explicit verification of invariance is missing. The replay paradigm relies on UE5's deterministic replay system with fixed action traces and initial states, and engine states are recorded to enable post-hoc checks, but we did not include any comparative analysis confirming that non-gravity components (e.g., collision responses outside gravity effects or animation states) remain unchanged. We will add this verification—using the supplied engine state logs—to Section 3 or a new appendix in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; dataset contribution is self-contained

full rationale

The paper introduces a multimodal dataset constructed via a UE5 replay paradigm that records and replays fixed action traces under varying gravity settings. No equations, fitted parameters, predictions, or first-principles derivations are described that could reduce to their own inputs by construction. Claims about enabling gravity-faithful modeling are presented as downstream utility studies rather than internal results forced by the dataset construction itself. The contribution is empirical data collection, not a closed-loop theoretical or predictive system, so no circularity patterns apply.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that the UE5 engine produces physically consistent rollouts when gravity is varied while holding other inputs fixed; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption UE5 physics simulation produces attributable changes when only gravity is modified while action traces and initial states remain identical.
    Invoked by the replay paradigm description in the abstract.

pith-pipeline@v0.9.1-grok · 5824 in / 1147 out tokens · 25302 ms · 2026-06-26T05:44:23.437347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 3 canonical work pages

  1. [1]

    Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie C. Y . Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Sati...

  2. [2]

    Diffusion for world modeling: Visual details matter in Atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and Francois Fleuret. Diffusion for world modeling: Visual details matter in Atari. InAdvances in Neural Information Processing Systems, 2024

  3. [3]

    Diffusion models are real- time game engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real- time game engines. InInternational Conference on Learning Representations, 2025. URL https://arxiv.org/abs/2408.14837

  4. [4]

    Gamegen-x: Interactive open-world game video generation

    Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation. InInternational Conference on Learning Representations, 2025

  5. [5]

    Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models.arXiv preprint arXiv:26...

  6. [6]

    Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory.arXiv preprint arXiv:2604.08995, 2026

    Zile Wang, Zexiang Liu, Jiaxing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, Yidan Xietian, Jiangbo Pei, Liang Hu, Boyi Jiang, Hua Xue, Zidong Wang, Haofeng Sun, Wei Li, Wanli Ouyang, Xianglong He, Yang Liu, Yangguang Li, and Yahui Zhou. Matrix-game 3.0: Real-time and streaming interactive world model with long-h...

  7. [7]

    Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

    Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

  8. [8]

    Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling

    Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.Journal of Artificial Intelligence Research, 47:253–279, 2013

  9. [9]

    Leveraging procedural generation to benchmark reinforcement learning

    Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. InInternational Conference on Machine Learning, 2020

  10. [10]

    Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel, Manuela Veloso, and Ruslan Salakhutdinov

    William H. Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel, Manuela Veloso, and Ruslan Salakhutdinov. MineRL: A large-scale dataset of Minecraft demonstrations. InInternational Joint Conference on Artificial Intelligence, 2019. 10

  11. [11]

    CARLA: An open urban driving simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InConference on Robot Learning, 2017

  12. [12]

    Sekai: A video dataset towards world exploration

    Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  13. [13]

    PHYRE: A new benchmark for physical reasoning

    Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Gir- shick. PHYRE: A new benchmark for physical reasoning. InAdvances in Neural Information Processing Systems, 2019

  14. [14]

    Tenenbaum

    Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenenbaum. CLEVRER: Collision events for video representation and reasoning. InInterna- tional Conference on Learning Representations, 2020

  15. [15]

    Bear, Elias Wang, Damian Mrowca, Felix J

    Daniel M. Bear, Elias Wang, Damian Mrowca, Felix J. Binder, et al. Physion: Evaluating physical prediction from vision in humans and machines. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021

  16. [16]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, et al. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  17. [17]

    VideoPhy: Evaluating physical commonsense for video generation

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. VideoPhy: Evaluating physical commonsense for video generation. InInternational Conference on Learning Representations, 2025

  18. [18]

    Towards world simulator: Crafting physical commonsense- based benchmark for video generation

    Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation. InInternational Conference on Machine Learning, 2025

  19. [19]

    Do gen- erative video models understand physical principles? InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 948–958, March 2026

    Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do gen- erative video models understand physical principles? InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 948–958, March 2026. URL https://arxiv.org/abs/2501.09038

  20. [20]

    Physinone: Visual physics learning and reasoning in one suite.arXiv preprint arXiv:2604.09415, 2026

    Siyuan Zhou, Hejun Wang, Hu Cheng, Jinxi Li, Dongsheng Wang, Junwei Jiang, Yixiao Jin, Jiayue Huang, Shiwei Mao, Shangjia Liu, et al. Physinone: Visual physics learning and reasoning in one suite.arXiv preprint arXiv:2604.09415, 2026

  21. [21]

    Learning to simulate dynamic environments with GameGAN

    Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, and Sanja Fidler. Learning to simulate dynamic environments with GameGAN. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

  22. [22]

    Playable video generation

    Willi Menapace, Stephane Lathuiliere, Sergey Tulyakov, Aliaksandr Siarohin, and Elisa Ricci. Playable video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

  23. [23]

    Playable environments: Video manipula- tion in space and time

    Willi Menapace, Stéphane Lathuilière, Aliaksandr Siarohin, Christian Theobalt, Sergey Tulyakov, Vladislav Golyanik, and Elisa Ricci. Playable environments: Video manipula- tion in space and time. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3584–3593, 2022. URL https://arxiv.org/abs/2203.01914

  24. [24]

    Promptable game models: Text-guided game simulation via masked diffusion models.ACM Transactions on Graphics, 2024

    Willi Menapace, Aliaksandr Siarohin, Stephane Lathuiliere, Panos Achlioptas, Vladislav Golyanik, Sergey Tulyakov, and Elisa Ricci. Promptable game models: Text-guided game simulation via masked diffusion models.ACM Transactions on Graphics, 2024

  25. [25]

    Oasis: A universe in a transformer

    Decart, Julian Quevedo, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer. Project page, 2024. URLhttps://oasis-model.github. io/. 11

  26. [26]

    Gamefactory: Creating new games with generative interactive videos

    Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with generative interactive videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  27. [27]

    Mineworld: A real-time and open-source interactive world model on Minecraft.arXiv preprint arXiv:2504.08388, 2025

    Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. Mineworld: A real-time and open-source interactive world model on Minecraft.arXiv preprint arXiv:2504.08388, 2025

  28. [28]

    Learning world models for interactive video generation.arXiv preprint arXiv:2505.21996, 2025

    Taiye Chen, Xun Hu, Zihan Ding, and Chi Jin. Learning world models for interactive video generation.arXiv preprint arXiv:2505.21996, 2025

  29. [29]

    Identity-consistent video generation under large facial-angle variations,

    Bin Hu, Zipeng Qi, Guoxi Huang, Zunnan Xu, Ruicheng Zhang, Chongjie Ye, Jun Zhou, Xiu Li, and Jingdong Wang. Identity-consistent video generation under large facial-angle variations,

  30. [30]

    URLhttps://arxiv.org/abs/2603.21299

  31. [31]

    Video pretraining (VPT): Learning to act by watching unlabeled online videos

    Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (VPT): Learning to act by watching unlabeled online videos. InAdvances in Neural Information Processing Systems, 2022. URLhttps://arxiv.org/abs/2206.11795

  32. [32]

    MineDojo: Building open-ended embodied agents with internet-scale knowledge

    Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. MineDojo: Building open-ended embodied agents with internet-scale knowledge. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum? id=rc8o_j8I8PX

  33. [33]

    Wildworld: A large-scale dataset for dynamic world modeling with actions and explicit state toward generative ARPG.arXiv preprint arXiv:2603.23497, 2026

    Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, and Kaipeng Zhang. Wildworld: A large-scale dataset for dynamic world modeling with actions and explicit state toward generative ARPG.arXiv preprint arXiv:2603.23497, 2026. URL https://arxiv.org/abs/2603.23497

  34. [34]

    Multiworld: Scalable multi-agent multi- view video world models.arXiv preprint arXiv:2604.18564, 2026

    Haoyu Wu, Jiwen Yu, Yingtian Zou, and Xihui Liu. Multiworld: Scalable multi-agent multi- view video world models.arXiv preprint arXiv:2604.18564, 2026

  35. [35]

    Precise action-to-video generation through visual action prompts

    Yuang Wang, Chao Wen, Haoyu Guo, Sida Peng, Minghan Qin, Hujun Bao, Xiaowei Zhou, and Ruizhen Hu. Precise action-to-video generation through visual action prompts. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. URL https://arxiv. org/abs/2508.13104

  36. [36]

    CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning

    Mohammad Reza Taesiri, Finlay Macklon, and Cor-Paul Bezemer. CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning. In2022 IEEE/ACM 19th International Conference on Mining Software Repositories, pages 270–281,

  37. [37]

    URLhttps://arxiv.org/abs/2203.11096

    doi:10.1145/3524842.3528438. URLhttps://arxiv.org/abs/2203.11096

  38. [38]

    Fuchs, Ingmar Posner, and Andrea Vedaldi

    Oliver Groth, Fabian B. Fuchs, Ingmar Posner, and Andrea Vedaldi. ShapeStacks: Learning vision-based physical intuition for generalised object stacking. InEuropean Conference on Computer Vision, 2018. URLhttps://arxiv.org/abs/1804.08018

  39. [39]

    Smith, Lingjie Mei, Shunyu Yao, Jiajun Wu, Elizabeth S

    Kevin A. Smith, Lingjie Mei, Shunyu Yao, Jiajun Wu, Elizabeth S. Spelke, Joshua B. Tenenbaum, and Tomer D. Ullman. Modeling expectation violation in intuitive physics with coarse probabilistic object representations. InAdvances in Neural Information Pro- cessing Systems, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ e88f243bf341ded9b4ced444795...

  40. [40]

    I-PHYRE: Interactive physical reasoning

    Shiqian Li, Kewen Wu, Chi Zhang, and Yixin Zhu. I-PHYRE: Interactive physical reasoning. InInternational Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=1bbPQShCT2

  41. [41]

    IntPhys 2019: A benchmark for visual intuitive physics understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):5016– 5025, 2022

    Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Veronique Izard, and Emmanuel Dupoux. IntPhys 2019: A benchmark for visual intuitive physics understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):5016– 5025, 2022. 12

  42. [42]

    Kao, Adina Williams, Michael Rabbat, and Em- manuel Dupoux

    Florian Bordes, Quentin Garrido, Justine T. Kao, Adina Williams, Michael Rabbat, and Em- manuel Dupoux. IntPhys 2: Benchmarking intuitive physics understanding in complex synthetic environments.arXiv preprint arXiv:2506.09849, 2025. URL https://arxiv.org/abs/ 2506.09849

  43. [43]

    CATER: A diagnostic dataset for compositional actions and TEmporal reasoning

    Rohit Girdhar and Deva Ramanan. CATER: A diagnostic dataset for compositional actions and TEmporal reasoning. InInternational Conference on Learning Representations, 2020. URL https://arxiv.org/abs/1910.04744

  44. [44]

    CoPhy: Counter- factual learning of physical dynamics

    Fabien Baradel, Natalia Neverova, Julien Mille, Greg Mori, and Christian Wolf. CoPhy: Counter- factual learning of physical dynamics. InInternational Conference on Learning Representations,

  45. [45]

    URLhttps://openreview.net/forum?id=SkeyppEFvS

  46. [46]

    Tenenbaum, and Chuang Gan

    Zhenfang Chen, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B. Tenenbaum, and Chuang Gan. ComPhy: Compositional physical reasoning of objects and events from videos. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=PgNEYaIc81Q

  47. [47]

    CRIPP-VQA: Counterfac- tual reasoning about implicit physical properties via video question answering

    Maitreya Patel, Tejas Gokhale, Chitta Baral, and Yezhou Yang. CRIPP-VQA: Counterfac- tual reasoning about implicit physical properties via video question answering. InProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9856–9870, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Lin- guistics....

  48. [48]

    Tenenbaum, and Chuang Gan

    Zhicheng Zheng, Xin Yan, Zhenfang Chen, Jingzhou Wang, Qin Zhi Eddie Lim, Joshua B. Tenenbaum, and Chuang Gan. ContPhy: Continuum physical concept learning and reasoning from videos. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 61526–61558. PMLR, 2024. URL https://proc...

  49. [49]

    Aaron Foss, Chloe Evans, Sasha Mitts, Koustuv Sinha, Ammar Rizvi, and Justine T. Kao. CausalVQA: A physically grounded causal reasoning benchmark for video models.arXiv preprint arXiv:2506.09943, 2025

  50. [50]

    QuantiPhy: A quantitative benchmark evaluating physical reasoning abilities of vision-language models.arXiv preprint arXiv:2512.19526, 2025

    Puyin Li, Tiange Xiang, Ella Mao, Shirley Wei, Xinye Chen, Adnan Masood, Li Fei-Fei, and Ehsan Adeli. QuantiPhy: A quantitative benchmark evaluating physical reasoning abilities of vision-language models.arXiv preprint arXiv:2512.19526, 2025

  51. [51]

    VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, and Ziwei Liu. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. URLhttps://arxiv.org/abs/2503.21755

  52. [52]

    Worldscore: A unified evaluation benchmark for world generation

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  53. [53]

    What about gravity in video generation? post-training newton’s laws with verifiable rewards.arXiv preprint arXiv:2512.00425, 2025

    Minh-Quan Le, Yuanzhi Zhu, Vicky Kalogeiton, and Dimitris Samaras. What about gravity in video generation? post-training newton’s laws with verifiable rewards.arXiv preprint arXiv:2512.00425, 2025

  54. [54]

    Mind-v: Hierarchical world model for long-horizon robotic manipulation with rl-based physical alignment, 2026

    Ruicheng Zhang, Mingyang Zhang, Jun Zhou, Xiaofan Liu, Zunnan Xu, Zhizhou Zhong, Puxin Yan, Haocheng Luo, and Xiu Li. Mind-v: Hierarchical world model for long-horizon robotic manipulation with rl-based physical alignment, 2026. URL https://arxiv.org/abs/2512. 06628

  55. [55]

    Robostereo: Dual-tower 4d embodied world models for unified policy optimization, 2026

    Ruicheng Zhang, Guangyu Chen, Zunnan Xu, Zihao Liu, Zhizhou Zhong, Mingyang Zhang, Jun Zhou, and Xiu Li. Robostereo: Dual-tower 4d embodied world models for unified policy optimization, 2026. URLhttps://arxiv.org/abs/2603.12639

  56. [56]

    PhysGen: Rigid-body physics-grounded image-to-video generation

    Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. PhysGen: Rigid-body physics-grounded image-to-video generation. InEuropean Conference on Computer Vision,

  57. [57]

    URLhttps://arxiv.org/abs/2409.18964

    doi:10.1007/978-3-031-73007-8_21. URLhttps://arxiv.org/abs/2409.18964. 13

  58. [58]

    Force prompting: Video generation models can learn and generalize physics-based control signals

    Nate Gillman, Charles Herrmann, Michael Freeman, Daksh Aggarwal, Evan Luo, Deqing Sun, and Chen Sun. Force prompting: Video generation models can learn and generalize physics-based control signals. InAdvances in Neural Information Processing Systems, 2025. URLhttps://arxiv.org/abs/2505.19386

  59. [59]

    PhysCtrl: Generative physics for controllable and physics-grounded video generation.arXiv preprint arXiv:2509.20358, 2025

    Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu, Jiatao Gu, and Lingjie Liu. PhysCtrl: Generative physics for controllable and physics-grounded video generation.arXiv preprint arXiv:2509.20358, 2025

  60. [60]

    PhysChoreo: Physics-controllable video generation with part-aware semantic grounding

    Haoze Zhang, Tianyu Huang, Zichen Wan, Xiaowei Jin, Hongzhi Zhang, Hui Li, and Wangmeng Zuo. PhysChoreo: Physics-controllable video generation with part-aware semantic grounding. arXiv preprint arXiv:2511.20562, 2025

  61. [61]

    Yu Yuan, Xijun Wang, Tharindu Wickremasinghe, Zeeshan Nadir, Bole Ma, and Stanley H. Chan. NewtonGen: Physics-consistent and controllable text-to-video generation via neural newtonian dynamics. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=rJ6N6sunaU

  62. [62]

    Phyco: Learning controllable physical priors for generative motion, 2026

    Sriram Narayanan, Ziyu Jiang, Srinivasa Narasimhan, and Manmohan Chandraker. Phyco: Learning controllable physical priors for generative motion, 2026. URLhttps://arxiv.org/ abs/2604.28169

  63. [63]

    PISA experiments: Exploring physics post-training for video diffusion models by watching stuff drop

    Chenyu Li, Oscar Michel, Xichen Pan, Sainan Liu, Mike Roberts, and Saining Xie. PISA experiments: Exploring physics post-training for video diffusion models by watching stuff drop. InInternational Conference on Machine Learning, 2025. URL https://arxiv.org/abs/ 2503.09595

  64. [64]

    AI2-THOR: An interactive 3d environment for visual AI.arXiv preprint arXiv:1712.05474, 2017

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. AI2-THOR: An interactive 3d environment for visual AI.arXiv preprint arXiv:1712.05474, 2017

  65. [65]

    RoboTHOR: An open simulation-to-real embodied AI platform

    Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mot- taghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, Luca Weihs, Mark Yatskar, and Ali Farhadi. RoboTHOR: An open simulation-to-real embodied AI platform. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. ...

  66. [66]

    ProcTHOR: Large-scale embodied AI using procedural generation

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Win- son Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-scale embodied AI using procedural generation. InAdvances in Neural Information Pro- cessing Systems, 2022. URL https://proceedings.neurips.cc/paper_files/paper/ 2022/hash/...

  67. [67]

    Habitat: A platform for embodied AI research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied AI research. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019

  68. [68]

    Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

    Fei Xia, Amir R. Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gib- son env: Real-world perception for embodied agents. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. URL https://openaccess.thecvf.com/ content_cvpr_2018/html/Xia_Gibson_Env_Real-World_CVPR_2018_paper.html

  69. [69]

    iGibson 2.0: Object-centric simulation for robot learning of everyday household tasks

    Chengshu Li, Fei Xia, Roberto Martín-Martín, Michael Lingelbach, Sanjana Srivastava, Bokui Shen, Kent Elliott Vainio, Cem Gokmen, Gokul Dharan, Tanish Jain, Andrey Kurenkov, Karen Liu, Hyowon Gweon, Jiajun Wu, Li Fei-Fei, and Silvio Savarese. iGibson 2.0: Object-centric simulation for robot learning of everyday household tasks. InProceedings of the 5th Co...

  70. [70]

    ThreeDWorld: A platform for interactive multi-modal physical simulation

    Chuang Gan, Jeremy Schwartz, Seth Alter, Damian Mrowca, Martin Schrimpf, James Traer, Julian De Freitas, Jonas Kubilius, Abhishek Bhandwaldar, Nick Haber, et al. ThreeDWorld: A platform for interactive multi-modal physical simulation. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021

  71. [71]

    Fleet, Daniel Gnanapragasam, Florian Golemo, Charles Herrmann, et al

    Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J. Fleet, Daniel Gnanapragasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  72. [72]

    VirtualHome: Simulating household activities via programs

    Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. VirtualHome: Simulating household activities via programs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8494–8502, 2018. URLhttps://arxiv.org/abs/1806.07011

  73. [73]

    BEHA VIOR-1k: A human-centered, embodied AI benchmark with 1,000 everyday activities and realistic simulation

    Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín- Martín, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, et al. BEHA VIOR-1k: A human-centered, embodied AI benchmark with 1,000 everyday activities and realistic simulation. arXiv preprint arXiv:2403.09227, 2024

  74. [74]

    Chang, Leonidas J

    Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, and Hao Su. SAPIEN: A simulated part-based interactive environment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. URL https://arxiv.org/ abs/2003.08515

  75. [75]

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. RLBench: The robot learning benchmark and learning environment.IEEE Robotics and Automation Letters, 5 (2):3019–3026, 2020. URLhttps://arxiv.org/abs/1909.12271

  76. [76]

    ManiSkill2: A unified benchmark for generalizable manipulation skills

    Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, Xiaodi Yuan, Pengwei Xie, Zhiao Huang, Rui Chen, and Hao Su. ManiSkill2: A unified benchmark for generalizable manipulation skills. In International Conference on Learning Representations, 2023. URL https://arxiv.org/ abs/2302.04659

  77. [77]

    Isaac Gym: High performance GPU-based physics simulation for robot learning

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State. Isaac Gym: High performance GPU-based physics simulation for robot learning. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021. URL https://arxiv.or...

  78. [78]

    UnrealCV: Connecting computer vision to unreal engine.arXiv preprint arXiv:1609.01326, 2016

    Weichao Qiu and Alan Yuille. UnrealCV: Connecting computer vision to unreal engine.arXiv preprint arXiv:1609.01326, 2016. URLhttps://arxiv.org/abs/1609.01326

  79. [79]

    UnrealZoo: Enriching photo-realistic virtual worlds for embodied AI

    Fangwei Zhong, Kui Wu, Churan Wang, Hao Chen, Hai Ci, Zhoujun Li, and Yizhou Wang. UnrealZoo: Enriching photo-realistic virtual worlds for embodied AI. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. URL https://arxiv.org/ abs/2412.20977. Highlight

  80. [80]

    Qwen3-VL technical report, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

Showing first 80 references.