pith. sign in

arxiv: 2606.28128 · v1 · pith:4UDYIL6Mnew · submitted 2026-06-26 · 💻 cs.CV · cs.AI· cs.RO

PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation

Pith reviewed 2026-06-29 04:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords video generationphysical consistencyrobotic manipulationworld modelsdiffusion transformersembodied simulationtrajectory alignment
0
0 comments X

The pith

PhysisForcing improves physical consistency in video-based robot world simulators by aligning DiT features at pixel and semantic levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that video generation models used for embodied world simulation can be strengthened against physical errors by directing supervision toward regions that carry physics information. It traces the main sources of instability to object deformations and implausible contact correlations, then introduces two alignment losses that operate on the same diffusion transformer layers. One loss follows reference point trajectories at the pixel level; the other matches inter-region relations extracted from a separate frozen encoder at the semantic level. If the joint optimization works, the generated videos become more reliable stand-ins for real physics during robotic planning and control loops. Readers would care because better physical fidelity in these models directly raises the chance that simulated trajectories transfer to actual robot hardware.

Core claim

PhysisForcing strengthens physical consistency in embodied video generation by joint optimization of pixel-level trajectory alignment using reference point trajectories and semantic-level relational alignment using inter-region relations from a frozen video understanding encoder, both applied to DiT features. This targets the deformation of moving objects and implausible spatio-temporal correlations among interacting entities during contact.

What carries the argument

PhysisForcing framework that applies a pixel-level trajectory alignment loss and a semantic-level relational alignment loss to DiT features.

If this is right

  • Higher scores on R-Bench, PAI-Bench and EZS-Bench than the base models and than vanilla fine-tuning.
  • Closed-loop success rate under the WorldArena action-planner protocol rises from 16 percent to 24 percent.
  • Downstream policy success improves when the aligned videos serve as the world model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment approach could be tested on non-robotic video generation tasks to check whether the gains are specific to manipulation scenes.
  • Longer-horizon rollouts might reveal whether the enforced relations remain stable beyond the training clip length.
  • The method might allow smaller robot datasets to reach similar consistency levels by replacing some data volume with targeted feature supervision.

Load-bearing premise

Physical instability in the generated videos stems primarily from object deformation and implausible contact correlations that joint pixel and semantic alignment on DiT layers can correct without creating new artifacts.

What would settle it

Videos produced by the method on unseen manipulation sequences that still display discontinuous trajectories or inconsistent robot-object contacts would show the alignment losses do not suffice.

read the original abstract

Video generation models have emerged as a promising paradigm for embodied world simulation. However, both general-domain video generators and robot-specific data fine-tuned models can still produce physically implausible manipulations, including discontinuous motion trajectories and inconsistent robot-object interactions, which limits their reliability as world simulators. Through extensive experiments, we find that such physical instability mainly arises from two factors: deformation of moving objects and implausible spatio-temporal correlations among interacting entities, particularly during contact. Building on this observation, we propose PhysisForcing, a scalable training framework that strengthens physical consistency by focusing supervision on physics-informative regions through joint optimization of pixel-level and semantic-level features. The framework consists of a pixel-level trajectory alignment loss, which supervises DiT features using reference point trajectories, and a semantic-level relational alignment loss, which aligns DiT features with inter-region relations extracted from a frozen video understanding encoder. Extensive experiments on R-Bench, PAI-Bench, and EZS-Bench show that PhysisForcing consistently improves embodied video generation over strong baselines, improving the Wan2.2-I2V-A14B and Cosmos3-Nano base models on R-Bench by 22.3\% and 9.2\% (7.1\% and 3.7\% over vanilla finetuning), with the Cosmos3-Nano variant attaining the best overall score. Beyond generation, as a world model under the WorldArena action-planner protocol it raises the closed-loop success rate from 16.0\% to 24.0\% and further improves downstream policy success, indicating that physically aligned video models yield stronger representations for robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that physical instability in video generation models for robotic manipulation stems primarily from object deformation and implausible spatio-temporal correlations during contact. It proposes PhysisForcing, which applies a pixel-level trajectory alignment loss (supervising DiT features with reference point trajectories) and a semantic-level relational alignment loss (aligning DiT features with inter-region relations from a frozen video encoder) to enforce physical consistency. Experiments on R-Bench, PAI-Bench, and EZS-Bench report gains of 22.3% and 9.2% (7.1% and 3.7% over vanilla finetuning) for Wan2.2-I2V-A14B and Cosmos3-Nano, with the latter achieving the best score; as a world model it also raises closed-loop success from 16.0% to 24.0% under WorldArena.

Significance. If the empirical results hold under full scrutiny, the targeted feature-alignment approach offers a scalable route to more reliable video-based world simulators, directly addressing a bottleneck in embodied planning. The downstream improvement in closed-loop success and policy performance indicates that the learned representations transfer beyond generation quality, which is a concrete strength for robotics applications.

minor comments (3)
  1. Abstract and §4: the reported percentage gains on R-Bench are given without the underlying absolute scores, standard deviations, or number of runs; adding these would allow readers to judge whether the 3.7–7.1% margins over vanilla finetuning are statistically reliable.
  2. §3.2: the precise weighting between the pixel-level and semantic-level losses is not stated; a short ablation or sensitivity table would clarify whether the joint optimization is robust or requires careful tuning.
  3. Figure 3 and Table 2: axis labels and legend entries use abbreviations (e.g., “PFA”, “TRA”) that are defined only in the caption; spelling them out or adding a small glossary would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report accurately reflects the core claims and results of the manuscript. As no major comments were provided, we have no specific points requiring rebuttal or clarification at this stage.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical method: experiments diagnose two sources of physical failure in video generation, two new alignment losses on DiT features are defined and applied during fine-tuning, and benchmark gains (R-Bench, PAI-Bench, EZS-Bench, closed-loop success) are reported. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The derivation chain consists of observation → proposed losses → measured improvement; each step is externally falsifiable on held-out benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; no free parameters, invented entities, or explicit axioms are stated beyond the domain assumption that feature alignment enforces physical consistency.

axioms (1)
  • domain assumption Physical instability in video generation mainly arises from object deformation and implausible spatio-temporal correlations during contact, and can be mitigated by supervising DiT features with trajectory and relational signals.
    This premise is stated directly in the abstract as the basis for the proposed framework.

pith-pipeline@v0.9.1-grok · 5877 in / 1348 out tokens · 40886 ms · 2026-06-29T04:31:34.299461+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 44 canonical work pages · 30 internal anchors

  1. [1]

    Cosmos 3: Omnimodal World Models for Physical AI

    Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, et al. Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026

  2. [2]

    World Simulation with Video Foundation Models for Physical AI

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

  3. [3]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video, 2024. URL https://arxiv.org/abs/2404.08471

  4. [4]

    Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

    Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, et al. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

  5. [5]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, et al. Genie: Generative interactive environments. 2024

  6. [6]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

  7. [7]

    Large Video Planner Enables Generalizable Robot Control

    Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, et al. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025

  8. [8]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels-v2: Infinite-length film generative model...

  9. [9]

    Video depth anything: Consistent depth estimation for super-long videos.arXiv preprint arXiv:2501.12375, 2025

    Sili Chen, Hengkai Guo, Shengnan Zhu, et al. Video depth anything: Consistent depth estimation for super-long videos.arXiv preprint arXiv:2501.12375, 2025

  10. [10]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  11. [11]

    Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment, 2026

    Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, and Mu Xu. Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment, 2026. URLhttps://arxiv.org/abs/2603. 23376

  12. [12]

    Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

    Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

  13. [13]

    Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

    Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

  14. [14]

    Learning universal policies via text-guided video generation.Advances in Neural Information Processing Systems, 36:9156–9172, 2023

    Yilun Du, Sherry Yang, Bo Dai, et al. Learning universal policies via text-guided video generation.Advances in Neural Information Processing Systems, 36:9156–9172, 2023

  15. [15]

    Vidar: Embodied Video Diffusion Model for Generalist Manipulation

    Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation, 2025. URLhttps://arxiv.org/abs/2507.12898

  16. [16]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

  17. [17]

    Veo-3 technical report

    Google DeepMind. Veo-3 technical report. 2025. URLhttps://storage.googleapis.com/deepmind-media/veo/ Veo-3-Tech-Report.pdf

  18. [18]

    Ctrl-World: A Controllable Generative World Model for Robot Manipulation

    Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

  19. [19]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 11

  20. [20]

    LTX-2: Efficient Joint Audio-Visual Foundation Model

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233, 2026

  21. [21]

    Hailuo.Hailuo Lab, 2025

    Hailuo. Hailuo.Hailuo Lab, 2025. URLhttps://hailuoai.video/

  22. [22]

    Alltracker: Efficient dense point tracking at high resolution

    Adam W Harley, Yang You, Xinglong Sun, et al. Alltracker: Efficient dense point tracking at high resolution. 2025

  23. [23]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, et al. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

  24. [24]

    DreamGen: Unlocking Generalization in Robot Learning through Video World Models

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zh...

  25. [25]

    Cotracker3: Simpler and better point tracking by pseudo-labelling real videos, 2024

    Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos, 2024. URLhttps://arxiv.org/ abs/2410.11831

  26. [26]

    Cotracker: It is better to track together, 2024

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together, 2024. URLhttps://arxiv.org/abs/2307.07635

  27. [27]

    Image to video elements feature, 2025

    Kling. Image to video elements feature, 2025. URLhttps://klingai.com/global/

  28. [28]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

  29. [29]

    Causal World Modeling for Robot Control

    Lin Li, Qihang Zhang, Yiming Luo, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  30. [30]

    Unified Video Action Model

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

  31. [31]

    Video Generators are Robot Policies

    Junbang Liang, Pavel Tokmakov, Ruoshi Liu, et al. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

  32. [32]

    Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

    Yue Liao, Pengfei Zhou, Siyuan Huang, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

  33. [33]

    Open x-embodiment: Robotic learning datasets and rt-x models

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, et al. Open x-embodiment: Robotic learning datasets and rt-x models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903, 2024

  34. [34]

    OpenAI. Sora. 2024. URLhttps://openai.com/sora/

  35. [35]

    Sora2, 2025

    OpenAI. Sora2, 2025. URLhttps://openai.com/zh-Hans-CN/index/sora-2/

  36. [36]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  37. [37]

    Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36, 2024

    Rafael Rafailov, Archit Sharma, Eric Mitchell, et al. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36, 2024

  38. [38]

    Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

    Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

  39. [39]

    Roboscape: Physics-informed embodied world model, 2025

    Yu Shang, Xin Zhang, Yinzhou Tang, Lei Jin, Chen Gao, Wei Wu, and Yong Li. Roboscape: Physics-informed embodied world model, 2025. URLhttps://arxiv.org/abs/2506.23135

  40. [40]

    Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

    Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026. 12

  41. [41]

    DINOv3

    Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  42. [42]

    Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

    GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

  43. [43]

    Longcat-video technical report, 2025

    Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, and Tong Zhang. Longcat-video technical report, 2025. URL https: //arxiv.org/abs/2510.22200

  44. [44]

    Unifolm-wma-0: A world-model-action (wma) framework under unifolm family, 2025

    Unitree. Unifolm-wma-0: A world-model-action (wma) framework under unifolm family, 2025

  45. [45]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  46. [46]

    Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023

    Yufei Wang, Zhou Xian, Feng Chen, et al. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023

  47. [47]

    HunyuanVideo 1.5 Technical Report

    Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

  48. [48]

    Depth Anything V2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2, 2024. URLhttps://arxiv.org/abs/2406.09414

  49. [49]

    Physics-driven data generation for contact-rich manipulation via trajectory optimization.arXiv preprint arXiv:2502.20382, 2025

    Lujie Yang, HJ Suh, Tong Zhao, et al. Physics-driven data generation for contact-rich manipulation via trajectory optimization.arXiv preprint arXiv:2502.20382, 2025

  50. [50]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  51. [51]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  52. [52]

    Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

  53. [53]

    Packing input frame contexts in next-frame prediction models for video generation.Arxiv, 2025

    Lvmin Zhang and Maneesh Agrawala. Packing input frame contexts in next-frame prediction models for video generation.Arxiv, 2025

  54. [54]

    MIND-V: Hierarchical World Model for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

    Ruicheng Zhang, Mingyang Zhang, Jun Zhou, Zhangrui Guo, Zunnan Xu, Xiaofan Liu, Zhizhou Zhong, Puxin Yan, Haocheng Luo, and Xiu Li. Mind-v: Hierarchical world model for long-horizon robotic manipulation with rl-based physical alignment, 2026. URLhttps://arxiv.org/abs/2512.06628

  55. [55]

    Videorepa: Learning physics for video generation through relational alignment with foundation models.arXiv preprint arXiv:2505.23656, 2025

    Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Videorepa: Learning physics for video generation through relational alignment with foundation models.arXiv preprint arXiv:2505.23656, 2025

  56. [56]

    Tesseract: Learning 4d embodied world models.arXiv preprint arXiv:2504.20995, 2025

    Haoyu Zhen, Qiao Sun, Hongxin Zhang, et al. Tesseract: Learning 4d embodied world models.arXiv preprint arXiv:2504.20995, 2025

  57. [57]

    Pai-bench: A comprehensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025

    Fengzhe Zhou, Jiannan Huang, Jialuo Li, Deva Ramanan, and Humphrey Shi. Pai-bench: A comprehensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025

  58. [58]

    RoboDreamer: Learning Compositional World Models for Robot Imagination

    Siyuan Zhou, Yilun Du, Jiaben Chen, et al. Robodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024. 13 PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation Appendix A More implementation details Thissectioncomplementsthemainpaperwiththeconcreteconfigurationoftheauxiliaryperceptionm...