OptiWorld: Optimal Control for Video World Generation under Physical Constraints

Daiqing Li; Jianhao Yuan; Liu He; Lu Ling; Stanley H. Chan; Xijun Wang; Yu Yuan

arxiv: 2606.00499 · v1 · pith:XGDN2BVSnew · submitted 2026-05-30 · 💻 cs.CV

OptiWorld: Optimal Control for Video World Generation under Physical Constraints

Yu Yuan , Jianhao Yuan , Xijun Wang , Daiqing Li , Liu He , Lu Ling , Stanley H. Chan This is my paper

Pith reviewed 2026-06-28 19:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationoptimal controlphysical constraintsworld modelstrajectory planningmanifold optimizationimage-to-videodynamics editing

0 comments

The pith

OptiWorld adds an optimal-control layer at inference time that extracts a world state, plans a physically constrained trajectory on a continuous manifold, and conditions video rendering on the plan.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video generation models create plausible motion but often produce trajectories that are unsafe, inefficient, or physically inconsistent with task goals. OptiWorld inserts a classical optimal-control step that first extracts a compact task-relevant world state from the model, then solves for an optimal trajectory under physical constraints, and finally renders the output video conditioned on that trajectory. The planning step is cast as a geometric optimization problem on a continuous manifold that unifies 3D geometry with task-dependent physical limits. A sympathetic reader would care because the same base video model can then support goal-conditioned image-to-video synthesis, dynamics editing, and counterfactual generation without retraining.

Core claim

OptiWorld brings classical optimal control into video generation at inference time: it extracts a compact task-relevant world state, formulates planning as a geometric problem on a continuous manifold that encodes both 3D structure and physical constraints, and renders the video conditioned on the resulting optimal trajectory, thereby producing outputs with preferable dynamics across goal-conditioned generation, dynamics editing, and counterfactual tasks.

What carries the argument

The optimal-control layer that converts 3D geometry and task-dependent physical constraints into a unified planning geometry on a continuous manifold.

If this is right

Goal-conditioned image-to-video generation can produce trajectories that reach specified end states while obeying smoothness and safety constraints.
Video dynamics editing can adjust an existing clip to follow a new, physically valid path without regenerating the entire sequence from scratch.
Counterfactual generation can explore alternative physically consistent outcomes from the same initial frames by varying the planned trajectory.
The same base video model can be reused across multiple control objectives without retraining or fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The manifold formulation might allow the same control layer to be ported to other generative domains such as 3D scene synthesis or audio generation.
If the world-state extraction step proves robust, the method could reduce reliance on physics simulation during training of the underlying video model.
Real-time interactive applications could become feasible if the geometric planning step can be made fast enough for online trajectory updates.

Load-bearing premise

A compact task-relevant world state can be reliably extracted from the video model and that formulating planning as a geometric problem on a continuous manifold is sufficient to enforce all relevant physical constraints without introducing new inconsistencies.

What would settle it

Generate videos with the control layer active and observe whether the rendered motion still exhibits clear physical violations (such as interpenetration, abrupt velocity changes, or failure to reach stated goals) at a rate comparable to the uncontrolled base model.

Figures

Figures reproduced from arXiv: 2606.00499 by Daiqing Li, Jianhao Yuan, Liu He, Lu Ling, Stanley H. Chan, Xijun Wang, Yu Yuan.

**Figure 1.** Figure 1: Left: without optimal control, a video generator may produce motions that look plausible but are unsafe, not smooth, or inefficient. Right: OptiWorld addresses these failures by introducing optimal control at video inference time: it plans a physically preferable trajectory before rendering. Zoom in for details of the planned/optimized trajectory. Abstract Video generation models are becoming a scalable fo… view at source ↗

**Figure 2.** Figure 2: OptiWorld pipeline. Given an initial frame and goal, the understanding stage builds a compact 3D world state from geometry, segmentation, VLM reasoning, and 3D tracking (for video dynamics editing). The planning stage converts this world state into a continuous constraint-induced manifold, where hazards become high-cost regions, the goal becomes a basin, and the optimized path follows a smooth low-cost val… view at source ↗

**Figure 3.** Figure 3: Visual comparisons on goal-conditioned I2V. OptiWorld generates videos with better goal reaching, safety, smoothness, and efficiency. OptiWorld Wan2.1 FLF2V Source Video first frame with 3D tracks last frame first frame with 3D tracks last frame (a) Video Dynamics Editing Comparisons (b) Trajectory Comparison Between Before (dashed yellow ) and After (solid green ) Optimization. Please Zoom In [PITH_FULL_… view at source ↗

**Figure 4.** Figure 4: Visual comparisons on video dynamics editing. OptiWorld improves the source trajectory, producing smoother and more efficient motion. 5.3 Video Dynamics Editing In video dynamics editing, we compare OptiWorld with the source video and Wan2.1 First-LastFrame-to-Video (FLF2V) [7], which is a natural baseline for preserving the start and end states. In [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Data annotation interface. A Benchmark Details [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Additional goal-conditioned image-to-video results. E More Results for Video Dynamics Editing [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Additional video dynamics editing results. OptiWorld refines the source 3D motion before rendering, producing smoother and shorter object trajectories while keeping the first-frame scene content stable. Note that OptiWorld only relies on the first frame, prompt, and the optimized tracks for video dynamics editing. F More Results for Ablation Study [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Additional ablation study results. H Limitations and Broader Impacts OptiWorld improves motion planning before video generation, but the final video quality is still limited by the renderer. DaS can introduce artifacts, object deformation, or imperfect texture preservation, especially when the requested motion is large or the selected object is thin, reflective, or heavily occluded. Another limitation is r… view at source ↗

**Figure 9.** Figure 9: Additional counterfactual results of goal control. Different goal sets produce different optimized paths [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Additional counterfactual results of safety control. Lowering the safety constraint weight allows hazardous shortcuts, while the full planner avoids risky regions. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

read the original abstract

Video generation models are becoming a scalable form of world models, but they mainly generate plausible motion rather than proactively control or optimize the underlying dynamics. As a result, an object in the generated video may follow trajectories that are unsafe, not smooth, inefficient, or physically inconsistent. In this work, we propose \textbf{OptiWorld}, a framework that brings classical optimal control into video generation at inference time. OptiWorld first extracts a compact, task-relevant world state, then plans an optimal trajectory under physical constraints, and finally renders the video conditioned on this trajectory. We formulate planning as a geometric problem on a continuous manifold, which converts 3D geometry and task-dependent physical constraints into a unified planning geometry. By adding this optimal-control layer, OptiWorld generates videos with preferable dynamics, demonstrating strong potential in multiple tasks including goal-conditioned image-to-video generation, video dynamics editing, and counterfactual generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OptiWorld adds an optimal-control planning step at inference time to steer video generation toward physically consistent trajectories, but the abstract supplies no equations, algorithms, or results to show whether the manifold formulation actually works.

read the letter

The core idea is straightforward: current video world models produce motion that looks plausible but can violate physical constraints, and OptiWorld tries to fix that by extracting a compact state, solving an optimal-control problem on a geometric manifold that folds in 3D geometry and task constraints, then conditioning the generator on the resulting trajectory. The approach is applied at inference rather than during training, and the authors list three downstream uses—goal-conditioned image-to-video, dynamics editing, and counterfactual generation.

The paper correctly flags a practical limitation in existing generative world models and sketches a pipeline that brings classical control ideas into the loop. Framing planning as geometry on a continuous manifold is a clean way to unify constraints, and doing the optimization at inference keeps the base video model unchanged.

The main weakness is that nothing is shown. No equations for the manifold or the cost function appear, no algorithm for state extraction is given, and there are no experiments, baselines, or failure cases. Without those, it is impossible to judge whether the extracted state is reliable, whether the manifold actually enforces the intended constraints, or whether the conditioned rendering preserves the gains from planning. The claim of “preferable dynamics” therefore rests on the framework description alone.

This is aimed at researchers working on controllable video generation and world models for planning or simulation. A reader already thinking about inference-time control or geometric planning might pick up useful pointers, but anyone needing concrete evidence or reproducible details will find little to work with.

If the full manuscript contains derivations, implementation details, and quantitative comparisons that hold up, it would be worth sending to referees. Based on what is here, the contribution is still at the level of an untested proposal.

Referee Report

1 major / 0 minor

Summary. The paper presents OptiWorld, a framework that integrates classical optimal control into video generation at inference time. It extracts a compact, task-relevant world state from the video model, plans an optimal trajectory under physical constraints by formulating the problem as a geometric optimization on a continuous manifold, and then renders the video conditioned on this trajectory. The approach is claimed to produce videos with preferable dynamics and shows potential for goal-conditioned image-to-video generation, video dynamics editing, and counterfactual generation.

Significance. If the proposed method can be shown to reliably extract states and enforce constraints without inconsistencies, it would represent a significant advancement in video world models by enabling optimization of underlying dynamics rather than just plausible motion. This could have broad implications for applications requiring physical consistency in generated videos.

major comments (1)

[Abstract] Abstract: The central claim that OptiWorld generates videos with preferable dynamics lacks any supporting equations, algorithms, implementation details, or experimental results. Without these, it is impossible to evaluate whether the state extraction is reliable or if the geometric planning on the manifold sufficiently enforces physical constraints, which is load-bearing for the contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and comments on our manuscript. Below we provide a point-by-point response to the major comment.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that OptiWorld generates videos with preferable dynamics lacks any supporting equations, algorithms, implementation details, or experimental results. Without these, it is impossible to evaluate whether the state extraction is reliable or if the geometric planning on the manifold sufficiently enforces physical constraints, which is load-bearing for the contribution.

Authors: The abstract is written as a concise summary of the contribution. The full manuscript details the state extraction procedure, formulates the planning problem as geometric optimization on a continuous manifold that unifies 3D geometry with task-dependent physical constraints, describes the inference-time optimal-control layer, and reports experimental results across goal-conditioned image-to-video generation, dynamics editing, and counterfactual tasks that demonstrate videos with preferable dynamics. These elements are presented in the methods and experiments sections and directly support evaluation of state-extraction reliability and constraint enforcement. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and description outline a high-level framework that extracts a world state, applies classical optimal control for trajectory planning on a manifold, and conditions video rendering on the result. No equations, algorithms, fitted parameters, or self-citations are present in the visible text that would allow any derivation step to reduce to its own inputs by construction. The central claims invoke external concepts from optimal control and geometry without redefining them in terms of the paper's outputs or renaming known results. This is the expected self-contained case when no load-bearing internal chain is exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the planning step implicitly assumes the manifold formulation captures constraints without additional fitting details.

pith-pipeline@v0.9.1-grok · 5697 in / 1076 out tokens · 22690 ms · 2026-06-28T19:00:54.920545+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

91 extracted references · 45 canonical work pages · 18 internal anchors

[1]

World Models

David Ha and Jürgen Schmidhuber, “World models,”arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Towards video world models,

Xun Huang, “Towards video world models,” 2025

2025
[3]

Video generation models as world simulators,

OpanAI, “Video generation models as world simulators,” https://openai.com/index/ video-generation-models-as-world-simulators/
[4]

Veo 3.1: Our state-of-the-art video generation model,

Google, “Veo 3.1: Our state-of-the-art video generation model,” https://aistudio.google.com/ models/veo-3/, 2025

2025
[5]

Seedance 2.0: Advancing Video Generation for World Complexity

Team Seedance, “Seedance 2.0: Advancing video generation for world complexity,”arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

Team HunyuanWorld, “Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels,”arXiv preprint arXiv:2507.21809, 2025

work page arXiv 2025
[7]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Genie 3: A new frontier for world models,

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...

2025
[9]

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Zile Wang, Zexiang Liu, Jiaxing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al., “Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory,”arXiv preprint arXiv:2604.08995, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Lyra 2.0: Explorable Generative 3D Worlds

Tianchang Shen, Sherwin Bahmani, Kai He, Sangeetha Grama Srinivasan, Tianshi Cao, Jiawei Ren, Ruilong Li, Zian Wang, Nicholas Sharp, Zan Gojcic, Sanja Fidler, Jiahui Huang, Huan Ling, Jun Gao, and Xuanchi Ren, “Lyra 2.0: Explorable generative 3d worlds,”arXiv preprint arXiv:2604.13036, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA, “World simulation with video foundation models for physical ai,”arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Cosmos policy: Fine-tuning video models for visuomotor control and planning,

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu, “Cosmos policy: Fine-tuning video models for visuomotor control and planning,” inInternational Conference on Learning Representations (ICLR), 2026

2026
[14]

Evaluating gemini robotics policies in a veo world simulator,

Gemini Robotics Team, “Evaluating gemini robotics policies in a veo world simulator,”arXiv preprint arXiv:2512.10675, 2025. 10

work page arXiv 2025
[15]

Diffusion models are real-time game engines,

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter, “Diffusion models are real-time game engines,” inInternational Conference on Learning Representations, 2025

2025
[16]

Hunyuan-gamecraft-2: Instruction-following interactive game world model

Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, and Qinglin Lu, “Hunyuan-gamecraft-2: Instruction-following interactive game world model,”arXiv preprint arXiv:2511.23429, 2025

work page arXiv 2025
[17]

Solaris: Building a multiplayer video world model in minecraft,

Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, and Saining Xie, “Solaris: Building a multiplayer video world model in minecraft,” arXiv preprint arXiv:2602.22208, 2026

work page arXiv 2026
[18]

Diffusion for world modeling: Visual details matter in atari,

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret, “Diffusion for world modeling: Visual details matter in atari,” inThirty-eighth Conference on Neural Information Processing Systems, 2024

2024
[19]

Video models are zero-shot learners and reasoners

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos, “Video models are zero-shot learners and reasoners,”arXiv preprint arXiv:2509.20328, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

How far is video generation from world model: A physical law perspective,

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng, “How far is video generation from world model: A physical law perspective,” inInternational Conference on Machine Learning, 2025

2025
[21]

Generative physical ai in vision: A survey,

Daochang Liu, Junyu Zhang, Anh-Dung Dinh, Eunbyung Park, Shichao Zhang, and Chang Xu, “Generative physical ai in vision: A survey,” 2025

2025
[22]

Do generative video models understand physical principles?,

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini Jaini, and Robert Geirhos, “Do generative video models understand physical principles?,” 2025

2025
[23]

Bansal, C

Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang, “VideoPhy-2: A challenging action-centric physical commonsense evaluation in video generation,”arXiv preprint arXiv:2503.06800, 2025

work page arXiv 2025
[24]

Towards world simulator: Crafting physical commonsense-based benchmark for video generation,

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Cheng Yu, Dianqi Li, Yu Qiao, and Ping Luo, “Towards world simulator: Crafting physical commonsense-based benchmark for video generation,” inInternational Conference on Machine Learning, 2025

2025
[25]

Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025

Chenyu Zhang, Daniil Cherniavskii, Andrii Zadaianchuk, Antonios Tragoudaras, Antonios V ozikis, Thij- men Nijdam, Derck W. E. Prinzhorn, Mark Bodracska, Nicu Sebe, and Efstratios Gavves, “Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments,”arXiv preprint arXiv:2504.02918, 2025

work page arXiv 2025
[26]

Pisa experiments: Exploring physics post-training for video diffusion models by watching stuff drop,

Chenyu Li, Oscar Michel, Xichen Pan, Sainan Liu, Mike Roberts, and Saining Xie, “Pisa experiments: Exploring physics post-training for video diffusion models by watching stuff drop,” inInternational Conference on Machine Learning, 2025

2025
[27]

World-in-world: World models in a closed-loop world,

Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M. Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan Yuille, Yilun Du, and Jieneng Chen, “World-in-world: World models in a closed-loop world,”arXiv preprint arXiv:2510.18135, 2025

work page arXiv 2025
[28]

Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning,

Jiaxi Lv, Yi Huang Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu Liu, Yafei Wen, Xiaoxin Chen, and Shifeng Chen, “Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[29]

Vlipp: Towards physically plausible video generation with vision and language informed physical prior,

Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, and Xu Jia, “Vlipp: Towards physically plausible video generation with vision and language informed physical prior,”arXiv preprint arXiv:2503.23368, 2025

work page arXiv 2025
[30]

Motion modes: What could happen next?,

Karran Pandey, Matheus Gadelha, Yannick Hold-Geoffroy, Karan Singh, Niloy J. Mitra, and Paul Guerrero, “Motion modes: What could happen next?,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[31]

PhyT2V: Llm-guided iterative self-refinement for physics-grounded text-to-video generation,

Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao, “PhyT2V: Llm-guided iterative self-refinement for physics-grounded text-to-video generation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 11

2025
[32]

WISA: World simulator assistant for physics- aware text-to-video generation,

Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, Yuhui Yin, and Xiaodan Liang, “WISA: World simulator assistant for physics- aware text-to-video generation,”arXiv preprint arXiv:2502.08153, 2025

work page arXiv 2025
[33]

Videojam: Joint appearance-motion representations for enhanced motion generation in video models,

Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin, “Videojam: Joint appearance-motion representations for enhanced motion generation in video models,”arXiv preprint arXiv:2502.02492, 2025

work page arXiv 2025
[34]

NewtonGen: Physics-consistent and controllable text-to-video generation via neural newtonian dynamics,

Yu Yuan, Xijun Wang, Tharindu Wickremasinghe, Zeeshan Nadir, Bole Ma, and Stanley H. Chan, “NewtonGen: Physics-consistent and controllable text-to-video generation via neural newtonian dynamics,” arXiv preprint arXiv: 2509.21309, 2025

work page arXiv 2025
[35]

SeeU: Seeing the unseen world via 4d dynamics-aware generation,

Yu Yuan, Tharindu Wickremasinghe, Zeeshan Nadir, Xijun Wang, Yiheng Chi, and Stanley H. Chan, “SeeU: Seeing the unseen world via 4d dynamics-aware generation,”arXiv preprint arXiv: 2512.03350, 2025

work page arXiv 2025
[36]

Stengel,Optimal Control and Estimation, Dover Publications, New York, USA, 1994

Robert F. Stengel,Optimal Control and Estimation, Dover Publications, New York, USA, 1994

1994
[37]

Levitor: 3d trajectory oriented image-to-video synthesis,

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, and Limin Wang, “Levitor: 3d trajectory oriented image-to-video synthesis,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[38]

Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise,

Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, Michael Ryoo, Paul Debevec, and Ning Yu, “Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[39]

Diffusion as shader: 3d-aware video diffusion for versatile video generation control,

Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, and Yuan Liu, “Diffusion as shader: 3d-aware video diffusion for versatile video generation control,”arXiv preprint arXiv:2501.03847, 2025

work page arXiv 2025
[40]

arXiv preprint arXiv:2505.22944 (2025) 12

Angtian Wang, Haibin Huang, Zhiyuan Fang, Yiding Yang, and Chongyang Ma, “ATI: Any trajectory instruction for controllable video generation,”arXiv preprint arXiv:2505.22944, 2025

work page arXiv 2025
[41]

Perception-as-control: Fine-grained controllable image animation with 3d-aware motion representation,

Yingjie Chen, Yifang Men, Yuan Yao, Miaomiao Cui, and Liefeng Bo, “Perception-as-control: Fine-grained controllable image animation with 3d-aware motion representation,”arXiv preprint arXiv:2501.05020, 2025

work page arXiv 2025
[42]

Motioncanvas: Cinematic shot design with controllable image-to-video generation,

Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Aniruddha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, and Feng Liu, “Motioncanvas: Cinematic shot design with controllable image-to-video generation,”arXiv preprint arXiv:2502.04299, 2025

work page arXiv 2025
[43]

Generative video motion editing with 3d point tracks,

Yao-Chih Lee, Zhoutong Zhang, Jiahui Huang, Jui-Hsien Wang, Joon-Young Lee, Jia-Bin Huang, Eli Shechtman, and Zhengqi Li, “Generative video motion editing with 3d point tracks,”arXiv preprint arXiv:2512.02015, 2025

work page arXiv 2025
[44]

Motionv2v: Editing motion in a video,

Ryan Burgert, Charles Herrmann, Forrester Cole, Michael S Ryoo, Neal Wadhwa, Andrey V oynov, and Nataniel Ruiz, “Motionv2v: Editing motion in a video,”arXiv preprint arXiv:2511.20640, 2025

work page arXiv 2025
[45]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang, “Moge-2: Accurate monocular geometry with metric scale and sharp details,” arXiv preprint arXiv:2507.02546, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shubham Debnath, Ronghang Hu, Dídac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Anqi Huang, Jiawen Lei, Tengyu Ma, Bowen Guo, Aditya Kalla, Matthew Marks, John Greer, Muxin Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Eleni Mavroudi, Kuan Xu, Tsung-Han Wu, Yitian Zhou, Lily Momeni, R...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Qwen2.5-vl,

Qwen Team, “Qwen2.5-vl,” 2025

2025
[48]

VLA-AN: An efficient and onboard vision-language-action framework for aerial navigation in complex environments,

Yuze Wu, Mo Zhu, Xingxing Li, Yuheng Du, Yuxin Fan, Wenjun Li, Zhichao Han, Xin Zhou, and Fei Gao, “VLA-AN: An efficient and onboard vision-language-action framework for aerial navigation in complex environments,”arXiv preprint arXiv:2512.15258, 2025

work page arXiv 2025
[49]

Diffog: Differentiable policy trajectory optimization with generalizability,

Zhengtong Xu, Zichen Miao, Qiang Qiu, Zhe Zhang, and Yu She, “Diffog: Differentiable policy trajectory optimization with generalizability,”IEEE Transactions on Robotics, 2026. 12

2026
[50]

Sampling-based algorithms for optimal motion planning,

Sertac Karaman and Emilio Frazzoli, “Sampling-based algorithms for optimal motion planning,”The International Journal of Robotics Research (IJRR), 2011

2011
[51]

A formal basis for the heuristic determination of minimum cost paths,

Peter E. Hart, Nils J. Nilsson, and Bertram Raphael, “A formal basis for the heuristic determination of minimum cost paths,”IEEE Transactions on Systems Science and Cybernetics, 1968

1968
[52]

Rawlings, D.Q

J. Rawlings, D.Q. Mayne, and Moritz Diehl,Model Predictive Control: Theory, Computation, and Design, 2017

2017
[53]

Francesco Bullo and Andrew D Lewis,Geometric control of mechanical systems: modeling, analysis, and design for simple mechanical control systems, 2005

2005
[54]

Denoising diffusion probabilistic models,

Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, 2020, p. 6840–6851

2020
[55]

Score-based generative modeling through stochastic differential equations,

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole, “Score-based generative modeling through stochastic differential equations,” inInternational Conference on Learning Representations, 2021

2021
[56]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv: 2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[57]

Scalable diffusion models with transformers,

William Peebles and Saining Xie, “Scalable diffusion models with transformers,” inInternational Conference on Computer Vision, 2023

2023
[58]

Cogvideox: Text-to-video diffusion models with an expert transformer,

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al., “Cogvideox: Text-to-video diffusion models with an expert transformer,” inInternational Conference on Learning Representations, 2025

2025
[59]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu, “WorldScore: A unified evaluation benchmark for world generation,”arXiv preprint arXiv:2504.00983, 2025

work page arXiv 2025
[61]

Likephys: Evaluating intuitive physics understanding in video diffusion models via likelihood preference,

Jianhao Yuan, Fabio Pizzati, Francesco Pinto, Lars Kunze, Ivan Laptev, Paul Newman, Philip Torr, and Daniele De Martini, “Likephys: Evaluating intuitive physics understanding in video diffusion models via likelihood preference,” inInternational Conference on Learning Representations, 2026

2026
[62]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang, “Camer- aCtrl: Enabling camera control for text-to-video generation,”arXiv preprint arXiv:2404.02101, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat, “CamCo: Camera-controllable 3d-consistent image-to-video generation,”arXiv preprint arXiv:2406.02509, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Training-free camera control for video generation,

Chen Hou, Guoqiang Wei, Yan Zeng, and Zhibo Chen, “Training-free camera control for video generation,” arXiv preprint arXiv:2406.10126, 2024

work page arXiv 2024
[65]

Cavia: Camera-controllable multi-view video diffusion with view-integrated attention,

Dejia Xu, Yifan Jiang, Chen Huang, Liangchen Song, Thorsten Gernoth, Liangliang Cao, Zhangyang Wang, and Hao Tang, “Cavia: Camera-controllable multi-view video diffusion with view-integrated attention,”arXiv preprint arXiv:2410.10774, 2024

work page arXiv 2024
[66]

Motionctrl: A unified and flexible motion controller for video generation,

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan, “Motionctrl: A unified and flexible motion controller for video generation,” inACM SIGGRAPH 2024 Conference Papers, 2024, pp. 1–11

2024
[67]

E.T. the exceptional trajec- tories: Text-to-camera-trajectory generation with character awareness,

Robin Courant, Nicolas Dufour, Xi Wang, Marc Christie, and Vicky Kalogeiton, “E.T. the exceptional trajec- tories: Text-to-camera-trajectory generation with character awareness,”arXiv preprint arXiv:2407.01516, 2024

work page arXiv 2024
[68]

Tora: Trajectory-oriented diffusion transformer for video generation,

Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang, “Tora: Trajectory-oriented diffusion transformer for video generation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[69]

Tora2: Motion and appearance customized diffusion transformer for multi-entity video generation,

Zhenghao Zhang, Junchao Liao, Xiangyu Meng, Long Qin, and Weizhi Wang, “Tora2: Motion and appearance customized diffusion transformer for multi-entity video generation,” inACM International Conference on Multimedia, 2025. 13

2025
[70]

Motion prompting: Controlling video generation with motion trajectories,

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez- Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, Chen Sun, Oliver Wang, Andrew Owens, and Deqing Sun, “Motion prompting: Controlling video generation with motion trajectories,”arXiv preprint arXiv:2412.02700, 2024

work page arXiv 2024
[71]

Phys4dgen: Physics-compliant 4d generation with multi-material composition perception

Jiajing Lin, Zhenzhong Wang, Shu Jiang, Yongjie Hou, and Min Jiang, “Phys4dgen: A physics-driven framework for controllable and efficient 4d content generation from a single image,”arXiv preprint arXiv:2411.16800, 2024

work page arXiv 2024
[72]

Physgaus- sian: Physics-integrated 3d gaussians for generative dynamics,

Tianyi Xie, Zeshun Zong, Yuxing Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang, “Physgaus- sian: Physics-integrated 3d gaussians for generative dynamics,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[73]

Physmotion: Physics-grounded dynamics from a single image,

Xiyang Tan, Ying Jiang, Xuan Li, Zeshun Zong, Tianyi Xie, Yin Yang, and Chenfanfu Jiang, “Physmotion: Physics-grounded dynamics from a single image,”arXiv preprint arXiv:2411.17189, 2024

work page arXiv 2024
[74]

PhysDreamer: Physics-based interaction with 3d objects via video generation,

Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y . Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T. Freeman, “PhysDreamer: Physics-based interaction with 3d objects via video generation,” inEuropean Conference on Computer Vision, 2024

2024
[75]

Autovfx: Physically realistic video editing from natural language instructions,

Hao-Yu Hsu, Zhi-Hao Lin, Albert Zhai, Hongchi Xia, and Shenlong Wang, “Autovfx: Physically realistic video editing from natural language instructions,”arXiv preprint arXiv:2411.02394, 2024

work page arXiv 2024
[76]

Physdiff: Physics-guided human motion diffusion model,

Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz, “Physdiff: Physics-guided human motion diffusion model,” inInternational Conference on Computer Vision, 2023

2023
[77]

Physgen: Rigid-body physics- grounded image-to-video generation,

Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang, “Physgen: Rigid-body physics- grounded image-to-video generation,” inEuropean Conference on Computer Vision, 2024

2024
[78]

Motioncraft: Physics-based zero-shot video generation,

Luca Savant Aira, Antonio Montanaro, Emanuele Aiello, Diego Valsesia, and Enrico Magli, “Motioncraft: Physics-based zero-shot video generation,” inAdvances in Neural Information Processing Systems, 2024

2024
[79]

Physgen3d: Crafting a miniature interactive world from a single image,

Boyuan Chen, Hanxiao Jiang, Shaowei Liu, Saurabh Gupta, Yunzhu Li, Hao Zhao, and Shenlong Wang, “Physgen3d: Crafting a miniature interactive world from a single image,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[80]

Physanimator: Physics-guided generative cartoon animation,

Tianyi Xie, Yiwei Zhao, Ying Jiang, and Chenfanfu Jiang, “Physanimator: Physics-guided generative cartoon animation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

Showing first 80 references.

[1] [1]

World Models

David Ha and Jürgen Schmidhuber, “World models,”arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Towards video world models,

Xun Huang, “Towards video world models,” 2025

2025

[3] [3]

Video generation models as world simulators,

OpanAI, “Video generation models as world simulators,” https://openai.com/index/ video-generation-models-as-world-simulators/

[4] [4]

Veo 3.1: Our state-of-the-art video generation model,

Google, “Veo 3.1: Our state-of-the-art video generation model,” https://aistudio.google.com/ models/veo-3/, 2025

2025

[5] [5]

Seedance 2.0: Advancing Video Generation for World Complexity

Team Seedance, “Seedance 2.0: Advancing video generation for world complexity,”arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

Team HunyuanWorld, “Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels,”arXiv preprint arXiv:2507.21809, 2025

work page arXiv 2025

[7] [7]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Genie 3: A new frontier for world models,

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...

2025

[9] [9]

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Zile Wang, Zexiang Liu, Jiaxing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, et al., “Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory,”arXiv preprint arXiv:2604.08995, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

Lyra 2.0: Explorable Generative 3D Worlds

Tianchang Shen, Sherwin Bahmani, Kai He, Sangeetha Grama Srinivasan, Tianshi Cao, Jiawei Ren, Ruilong Li, Zian Wang, Nicholas Sharp, Zan Gojcic, Sanja Fidler, Jiahui Huang, Huan Ling, Jun Gao, and Xuanchi Ren, “Lyra 2.0: Explorable generative 3d worlds,”arXiv preprint arXiv:2604.13036, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA, “World simulation with video foundation models for physical ai,”arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Cosmos policy: Fine-tuning video models for visuomotor control and planning,

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu, “Cosmos policy: Fine-tuning video models for visuomotor control and planning,” inInternational Conference on Learning Representations (ICLR), 2026

2026

[14] [14]

Evaluating gemini robotics policies in a veo world simulator,

Gemini Robotics Team, “Evaluating gemini robotics policies in a veo world simulator,”arXiv preprint arXiv:2512.10675, 2025. 10

work page arXiv 2025

[15] [15]

Diffusion models are real-time game engines,

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter, “Diffusion models are real-time game engines,” inInternational Conference on Learning Representations, 2025

2025

[16] [16]

Hunyuan-gamecraft-2: Instruction-following interactive game world model

Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, and Qinglin Lu, “Hunyuan-gamecraft-2: Instruction-following interactive game world model,”arXiv preprint arXiv:2511.23429, 2025

work page arXiv 2025

[17] [17]

Solaris: Building a multiplayer video world model in minecraft,

Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, and Saining Xie, “Solaris: Building a multiplayer video world model in minecraft,” arXiv preprint arXiv:2602.22208, 2026

work page arXiv 2026

[18] [18]

Diffusion for world modeling: Visual details matter in atari,

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret, “Diffusion for world modeling: Visual details matter in atari,” inThirty-eighth Conference on Neural Information Processing Systems, 2024

2024

[19] [19]

Video models are zero-shot learners and reasoners

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos, “Video models are zero-shot learners and reasoners,”arXiv preprint arXiv:2509.20328, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

How far is video generation from world model: A physical law perspective,

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng, “How far is video generation from world model: A physical law perspective,” inInternational Conference on Machine Learning, 2025

2025

[21] [21]

Generative physical ai in vision: A survey,

Daochang Liu, Junyu Zhang, Anh-Dung Dinh, Eunbyung Park, Shichao Zhang, and Chang Xu, “Generative physical ai in vision: A survey,” 2025

2025

[22] [22]

Do generative video models understand physical principles?,

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini Jaini, and Robert Geirhos, “Do generative video models understand physical principles?,” 2025

2025

[23] [23]

Bansal, C

Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang, “VideoPhy-2: A challenging action-centric physical commonsense evaluation in video generation,”arXiv preprint arXiv:2503.06800, 2025

work page arXiv 2025

[24] [24]

Towards world simulator: Crafting physical commonsense-based benchmark for video generation,

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Cheng Yu, Dianqi Li, Yu Qiao, and Ping Luo, “Towards world simulator: Crafting physical commonsense-based benchmark for video generation,” inInternational Conference on Machine Learning, 2025

2025

[25] [25]

Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025

Chenyu Zhang, Daniil Cherniavskii, Andrii Zadaianchuk, Antonios Tragoudaras, Antonios V ozikis, Thij- men Nijdam, Derck W. E. Prinzhorn, Mark Bodracska, Nicu Sebe, and Efstratios Gavves, “Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments,”arXiv preprint arXiv:2504.02918, 2025

work page arXiv 2025

[26] [26]

Pisa experiments: Exploring physics post-training for video diffusion models by watching stuff drop,

Chenyu Li, Oscar Michel, Xichen Pan, Sainan Liu, Mike Roberts, and Saining Xie, “Pisa experiments: Exploring physics post-training for video diffusion models by watching stuff drop,” inInternational Conference on Machine Learning, 2025

2025

[27] [27]

World-in-world: World models in a closed-loop world,

Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M. Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan Yuille, Yilun Du, and Jieneng Chen, “World-in-world: World models in a closed-loop world,”arXiv preprint arXiv:2510.18135, 2025

work page arXiv 2025

[28] [28]

Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning,

Jiaxi Lv, Yi Huang Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu Liu, Yafei Wen, Xiaoxin Chen, and Shifeng Chen, “Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[29] [29]

Vlipp: Towards physically plausible video generation with vision and language informed physical prior,

Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, and Xu Jia, “Vlipp: Towards physically plausible video generation with vision and language informed physical prior,”arXiv preprint arXiv:2503.23368, 2025

work page arXiv 2025

[30] [30]

Motion modes: What could happen next?,

Karran Pandey, Matheus Gadelha, Yannick Hold-Geoffroy, Karan Singh, Niloy J. Mitra, and Paul Guerrero, “Motion modes: What could happen next?,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

[31] [31]

PhyT2V: Llm-guided iterative self-refinement for physics-grounded text-to-video generation,

Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao, “PhyT2V: Llm-guided iterative self-refinement for physics-grounded text-to-video generation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 11

2025

[32] [32]

WISA: World simulator assistant for physics- aware text-to-video generation,

Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, Yuhui Yin, and Xiaodan Liang, “WISA: World simulator assistant for physics- aware text-to-video generation,”arXiv preprint arXiv:2502.08153, 2025

work page arXiv 2025

[33] [33]

Videojam: Joint appearance-motion representations for enhanced motion generation in video models,

Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin, “Videojam: Joint appearance-motion representations for enhanced motion generation in video models,”arXiv preprint arXiv:2502.02492, 2025

work page arXiv 2025

[34] [34]

NewtonGen: Physics-consistent and controllable text-to-video generation via neural newtonian dynamics,

Yu Yuan, Xijun Wang, Tharindu Wickremasinghe, Zeeshan Nadir, Bole Ma, and Stanley H. Chan, “NewtonGen: Physics-consistent and controllable text-to-video generation via neural newtonian dynamics,” arXiv preprint arXiv: 2509.21309, 2025

work page arXiv 2025

[35] [35]

SeeU: Seeing the unseen world via 4d dynamics-aware generation,

Yu Yuan, Tharindu Wickremasinghe, Zeeshan Nadir, Xijun Wang, Yiheng Chi, and Stanley H. Chan, “SeeU: Seeing the unseen world via 4d dynamics-aware generation,”arXiv preprint arXiv: 2512.03350, 2025

work page arXiv 2025

[36] [36]

Stengel,Optimal Control and Estimation, Dover Publications, New York, USA, 1994

Robert F. Stengel,Optimal Control and Estimation, Dover Publications, New York, USA, 1994

1994

[37] [37]

Levitor: 3d trajectory oriented image-to-video synthesis,

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, and Limin Wang, “Levitor: 3d trajectory oriented image-to-video synthesis,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

[38] [38]

Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise,

Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, Michael Ryoo, Paul Debevec, and Ning Yu, “Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

[39] [39]

Diffusion as shader: 3d-aware video diffusion for versatile video generation control,

Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, and Yuan Liu, “Diffusion as shader: 3d-aware video diffusion for versatile video generation control,”arXiv preprint arXiv:2501.03847, 2025

work page arXiv 2025

[40] [40]

arXiv preprint arXiv:2505.22944 (2025) 12

Angtian Wang, Haibin Huang, Zhiyuan Fang, Yiding Yang, and Chongyang Ma, “ATI: Any trajectory instruction for controllable video generation,”arXiv preprint arXiv:2505.22944, 2025

work page arXiv 2025

[41] [41]

Perception-as-control: Fine-grained controllable image animation with 3d-aware motion representation,

Yingjie Chen, Yifang Men, Yuan Yao, Miaomiao Cui, and Liefeng Bo, “Perception-as-control: Fine-grained controllable image animation with 3d-aware motion representation,”arXiv preprint arXiv:2501.05020, 2025

work page arXiv 2025

[42] [42]

Motioncanvas: Cinematic shot design with controllable image-to-video generation,

Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Aniruddha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, and Feng Liu, “Motioncanvas: Cinematic shot design with controllable image-to-video generation,”arXiv preprint arXiv:2502.04299, 2025

work page arXiv 2025

[43] [43]

Generative video motion editing with 3d point tracks,

Yao-Chih Lee, Zhoutong Zhang, Jiahui Huang, Jui-Hsien Wang, Joon-Young Lee, Jia-Bin Huang, Eli Shechtman, and Zhengqi Li, “Generative video motion editing with 3d point tracks,”arXiv preprint arXiv:2512.02015, 2025

work page arXiv 2025

[44] [44]

Motionv2v: Editing motion in a video,

Ryan Burgert, Charles Herrmann, Forrester Cole, Michael S Ryoo, Neal Wadhwa, Andrey V oynov, and Nataniel Ruiz, “Motionv2v: Editing motion in a video,”arXiv preprint arXiv:2511.20640, 2025

work page arXiv 2025

[45] [45]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang, “Moge-2: Accurate monocular geometry with metric scale and sharp details,” arXiv preprint arXiv:2507.02546, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shubham Debnath, Ronghang Hu, Dídac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Anqi Huang, Jiawen Lei, Tengyu Ma, Bowen Guo, Aditya Kalla, Matthew Marks, John Greer, Muxin Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Eleni Mavroudi, Kuan Xu, Tsung-Han Wu, Yitian Zhou, Lily Momeni, R...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Qwen2.5-vl,

Qwen Team, “Qwen2.5-vl,” 2025

2025

[48] [48]

VLA-AN: An efficient and onboard vision-language-action framework for aerial navigation in complex environments,

Yuze Wu, Mo Zhu, Xingxing Li, Yuheng Du, Yuxin Fan, Wenjun Li, Zhichao Han, Xin Zhou, and Fei Gao, “VLA-AN: An efficient and onboard vision-language-action framework for aerial navigation in complex environments,”arXiv preprint arXiv:2512.15258, 2025

work page arXiv 2025

[49] [49]

Diffog: Differentiable policy trajectory optimization with generalizability,

Zhengtong Xu, Zichen Miao, Qiang Qiu, Zhe Zhang, and Yu She, “Diffog: Differentiable policy trajectory optimization with generalizability,”IEEE Transactions on Robotics, 2026. 12

2026

[50] [50]

Sampling-based algorithms for optimal motion planning,

Sertac Karaman and Emilio Frazzoli, “Sampling-based algorithms for optimal motion planning,”The International Journal of Robotics Research (IJRR), 2011

2011

[51] [51]

A formal basis for the heuristic determination of minimum cost paths,

Peter E. Hart, Nils J. Nilsson, and Bertram Raphael, “A formal basis for the heuristic determination of minimum cost paths,”IEEE Transactions on Systems Science and Cybernetics, 1968

1968

[52] [52]

Rawlings, D.Q

J. Rawlings, D.Q. Mayne, and Moritz Diehl,Model Predictive Control: Theory, Computation, and Design, 2017

2017

[53] [53]

Francesco Bullo and Andrew D Lewis,Geometric control of mechanical systems: modeling, analysis, and design for simple mechanical control systems, 2005

2005

[54] [54]

Denoising diffusion probabilistic models,

Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, 2020, p. 6840–6851

2020

[55] [55]

Score-based generative modeling through stochastic differential equations,

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole, “Score-based generative modeling through stochastic differential equations,” inInternational Conference on Learning Representations, 2021

2021

[56] [56]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv: 2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[57] [57]

Scalable diffusion models with transformers,

William Peebles and Saining Xie, “Scalable diffusion models with transformers,” inInternational Conference on Computer Vision, 2023

2023

[58] [58]

Cogvideox: Text-to-video diffusion models with an expert transformer,

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al., “Cogvideox: Text-to-video diffusion models with an expert transformer,” inInternational Conference on Learning Representations, 2025

2025

[59] [59]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu, “WorldScore: A unified evaluation benchmark for world generation,”arXiv preprint arXiv:2504.00983, 2025

work page arXiv 2025

[61] [61]

Likephys: Evaluating intuitive physics understanding in video diffusion models via likelihood preference,

Jianhao Yuan, Fabio Pizzati, Francesco Pinto, Lars Kunze, Ivan Laptev, Paul Newman, Philip Torr, and Daniele De Martini, “Likephys: Evaluating intuitive physics understanding in video diffusion models via likelihood preference,” inInternational Conference on Learning Representations, 2026

2026

[62] [62]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang, “Camer- aCtrl: Enabling camera control for text-to-video generation,”arXiv preprint arXiv:2404.02101, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [63]

CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat, “CamCo: Camera-controllable 3d-consistent image-to-video generation,”arXiv preprint arXiv:2406.02509, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

Training-free camera control for video generation,

Chen Hou, Guoqiang Wei, Yan Zeng, and Zhibo Chen, “Training-free camera control for video generation,” arXiv preprint arXiv:2406.10126, 2024

work page arXiv 2024

[65] [65]

Cavia: Camera-controllable multi-view video diffusion with view-integrated attention,

Dejia Xu, Yifan Jiang, Chen Huang, Liangchen Song, Thorsten Gernoth, Liangliang Cao, Zhangyang Wang, and Hao Tang, “Cavia: Camera-controllable multi-view video diffusion with view-integrated attention,”arXiv preprint arXiv:2410.10774, 2024

work page arXiv 2024

[66] [66]

Motionctrl: A unified and flexible motion controller for video generation,

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan, “Motionctrl: A unified and flexible motion controller for video generation,” inACM SIGGRAPH 2024 Conference Papers, 2024, pp. 1–11

2024

[67] [67]

E.T. the exceptional trajec- tories: Text-to-camera-trajectory generation with character awareness,

Robin Courant, Nicolas Dufour, Xi Wang, Marc Christie, and Vicky Kalogeiton, “E.T. the exceptional trajec- tories: Text-to-camera-trajectory generation with character awareness,”arXiv preprint arXiv:2407.01516, 2024

work page arXiv 2024

[68] [68]

Tora: Trajectory-oriented diffusion transformer for video generation,

Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang, “Tora: Trajectory-oriented diffusion transformer for video generation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

[69] [69]

Tora2: Motion and appearance customized diffusion transformer for multi-entity video generation,

Zhenghao Zhang, Junchao Liao, Xiangyu Meng, Long Qin, and Weizhi Wang, “Tora2: Motion and appearance customized diffusion transformer for multi-entity video generation,” inACM International Conference on Multimedia, 2025. 13

2025

[70] [70]

Motion prompting: Controlling video generation with motion trajectories,

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez- Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, Chen Sun, Oliver Wang, Andrew Owens, and Deqing Sun, “Motion prompting: Controlling video generation with motion trajectories,”arXiv preprint arXiv:2412.02700, 2024

work page arXiv 2024

[71] [71]

Phys4dgen: Physics-compliant 4d generation with multi-material composition perception

Jiajing Lin, Zhenzhong Wang, Shu Jiang, Yongjie Hou, and Min Jiang, “Phys4dgen: A physics-driven framework for controllable and efficient 4d content generation from a single image,”arXiv preprint arXiv:2411.16800, 2024

work page arXiv 2024

[72] [72]

Physgaus- sian: Physics-integrated 3d gaussians for generative dynamics,

Tianyi Xie, Zeshun Zong, Yuxing Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang, “Physgaus- sian: Physics-integrated 3d gaussians for generative dynamics,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[73] [73]

Physmotion: Physics-grounded dynamics from a single image,

Xiyang Tan, Ying Jiang, Xuan Li, Zeshun Zong, Tianyi Xie, Yin Yang, and Chenfanfu Jiang, “Physmotion: Physics-grounded dynamics from a single image,”arXiv preprint arXiv:2411.17189, 2024

work page arXiv 2024

[74] [74]

PhysDreamer: Physics-based interaction with 3d objects via video generation,

Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y . Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T. Freeman, “PhysDreamer: Physics-based interaction with 3d objects via video generation,” inEuropean Conference on Computer Vision, 2024

2024

[75] [75]

Autovfx: Physically realistic video editing from natural language instructions,

Hao-Yu Hsu, Zhi-Hao Lin, Albert Zhai, Hongchi Xia, and Shenlong Wang, “Autovfx: Physically realistic video editing from natural language instructions,”arXiv preprint arXiv:2411.02394, 2024

work page arXiv 2024

[76] [76]

Physdiff: Physics-guided human motion diffusion model,

Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz, “Physdiff: Physics-guided human motion diffusion model,” inInternational Conference on Computer Vision, 2023

2023

[77] [77]

Physgen: Rigid-body physics- grounded image-to-video generation,

Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang, “Physgen: Rigid-body physics- grounded image-to-video generation,” inEuropean Conference on Computer Vision, 2024

2024

[78] [78]

Motioncraft: Physics-based zero-shot video generation,

Luca Savant Aira, Antonio Montanaro, Emanuele Aiello, Diego Valsesia, and Enrico Magli, “Motioncraft: Physics-based zero-shot video generation,” inAdvances in Neural Information Processing Systems, 2024

2024

[79] [79]

Physgen3d: Crafting a miniature interactive world from a single image,

Boyuan Chen, Hanxiao Jiang, Shaowei Liu, Saurabh Gupta, Yunzhu Li, Hao Zhao, and Shenlong Wang, “Physgen3d: Crafting a miniature interactive world from a single image,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

[80] [80]

Physanimator: Physics-guided generative cartoon animation,

Tianyi Xie, Yiwei Zhao, Ying Jiang, and Chenfanfu Jiang, “Physanimator: Physics-guided generative cartoon animation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025