PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation
Pith reviewed 2026-06-29 04:31 UTC · model grok-4.3
The pith
PhysisForcing improves physical consistency in video-based robot world simulators by aligning DiT features at pixel and semantic levels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PhysisForcing strengthens physical consistency in embodied video generation by joint optimization of pixel-level trajectory alignment using reference point trajectories and semantic-level relational alignment using inter-region relations from a frozen video understanding encoder, both applied to DiT features. This targets the deformation of moving objects and implausible spatio-temporal correlations among interacting entities during contact.
What carries the argument
PhysisForcing framework that applies a pixel-level trajectory alignment loss and a semantic-level relational alignment loss to DiT features.
If this is right
- Higher scores on R-Bench, PAI-Bench and EZS-Bench than the base models and than vanilla fine-tuning.
- Closed-loop success rate under the WorldArena action-planner protocol rises from 16 percent to 24 percent.
- Downstream policy success improves when the aligned videos serve as the world model.
Where Pith is reading between the lines
- The same alignment approach could be tested on non-robotic video generation tasks to check whether the gains are specific to manipulation scenes.
- Longer-horizon rollouts might reveal whether the enforced relations remain stable beyond the training clip length.
- The method might allow smaller robot datasets to reach similar consistency levels by replacing some data volume with targeted feature supervision.
Load-bearing premise
Physical instability in the generated videos stems primarily from object deformation and implausible contact correlations that joint pixel and semantic alignment on DiT layers can correct without creating new artifacts.
What would settle it
Videos produced by the method on unseen manipulation sequences that still display discontinuous trajectories or inconsistent robot-object contacts would show the alignment losses do not suffice.
read the original abstract
Video generation models have emerged as a promising paradigm for embodied world simulation. However, both general-domain video generators and robot-specific data fine-tuned models can still produce physically implausible manipulations, including discontinuous motion trajectories and inconsistent robot-object interactions, which limits their reliability as world simulators. Through extensive experiments, we find that such physical instability mainly arises from two factors: deformation of moving objects and implausible spatio-temporal correlations among interacting entities, particularly during contact. Building on this observation, we propose PhysisForcing, a scalable training framework that strengthens physical consistency by focusing supervision on physics-informative regions through joint optimization of pixel-level and semantic-level features. The framework consists of a pixel-level trajectory alignment loss, which supervises DiT features using reference point trajectories, and a semantic-level relational alignment loss, which aligns DiT features with inter-region relations extracted from a frozen video understanding encoder. Extensive experiments on R-Bench, PAI-Bench, and EZS-Bench show that PhysisForcing consistently improves embodied video generation over strong baselines, improving the Wan2.2-I2V-A14B and Cosmos3-Nano base models on R-Bench by 22.3\% and 9.2\% (7.1\% and 3.7\% over vanilla finetuning), with the Cosmos3-Nano variant attaining the best overall score. Beyond generation, as a world model under the WorldArena action-planner protocol it raises the closed-loop success rate from 16.0\% to 24.0\% and further improves downstream policy success, indicating that physically aligned video models yield stronger representations for robotic manipulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that physical instability in video generation models for robotic manipulation stems primarily from object deformation and implausible spatio-temporal correlations during contact. It proposes PhysisForcing, which applies a pixel-level trajectory alignment loss (supervising DiT features with reference point trajectories) and a semantic-level relational alignment loss (aligning DiT features with inter-region relations from a frozen video encoder) to enforce physical consistency. Experiments on R-Bench, PAI-Bench, and EZS-Bench report gains of 22.3% and 9.2% (7.1% and 3.7% over vanilla finetuning) for Wan2.2-I2V-A14B and Cosmos3-Nano, with the latter achieving the best score; as a world model it also raises closed-loop success from 16.0% to 24.0% under WorldArena.
Significance. If the empirical results hold under full scrutiny, the targeted feature-alignment approach offers a scalable route to more reliable video-based world simulators, directly addressing a bottleneck in embodied planning. The downstream improvement in closed-loop success and policy performance indicates that the learned representations transfer beyond generation quality, which is a concrete strength for robotics applications.
minor comments (3)
- Abstract and §4: the reported percentage gains on R-Bench are given without the underlying absolute scores, standard deviations, or number of runs; adding these would allow readers to judge whether the 3.7–7.1% margins over vanilla finetuning are statistically reliable.
- §3.2: the precise weighting between the pixel-level and semantic-level losses is not stated; a short ablation or sensitivity table would clarify whether the joint optimization is robust or requires careful tuning.
- Figure 3 and Table 2: axis labels and legend entries use abbreviations (e.g., “PFA”, “TRA”) that are defined only in the caption; spelling them out or adding a small glossary would improve readability.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report accurately reflects the core claims and results of the manuscript. As no major comments were provided, we have no specific points requiring rebuttal or clarification at this stage.
Circularity Check
No significant circularity
full rationale
The paper presents an empirical method: experiments diagnose two sources of physical failure in video generation, two new alignment losses on DiT features are defined and applied during fine-tuning, and benchmark gains (R-Bench, PAI-Bench, EZS-Bench, closed-loop success) are reported. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The derivation chain consists of observation → proposed losses → measured improvement; each step is externally falsifiable on held-out benchmarks and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Physical instability in video generation mainly arises from object deformation and implausible spatio-temporal correlations during contact, and can be mitigated by supervising DiT features with trajectory and relational signals.
Reference graph
Works this paper leans on
-
[1]
Cosmos 3: Omnimodal World Models for Physical AI
Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, et al. Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
World Simulation with Video Foundation Models for Physical AI
Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video, 2024. URL https://arxiv.org/abs/2404.08471
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, et al. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Genie: Generative interactive environments
Jake Bruce, Michael D Dennis, Ashley Edwards, et al. Genie: Generative interactive environments. 2024
2024
-
[6]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Chi-Lam Cheang, Guangzeng Chen, Ya Jing, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Large Video Planner Enables Generalizable Robot Control
Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, et al. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
SkyReels-V2: Infinite-length Film Generative Model
Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels-v2: Infinite-length film generative model...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Sili Chen, Hengkai Guo, Shengnan Zhu, et al. Video depth anything: Consistent depth estimation for super-long videos.arXiv preprint arXiv:2501.12375, 2025
-
[10]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment, 2026
Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, and Mu Xu. Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment, 2026. URLhttps://arxiv.org/abs/2603. 23376
2026
-
[12]
Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025
-
[13]
Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026
Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026
-
[14]
Learning universal policies via text-guided video generation.Advances in Neural Information Processing Systems, 36:9156–9172, 2023
Yilun Du, Sherry Yang, Bo Dai, et al. Learning universal policies via text-guided video generation.Advances in Neural Information Processing Systems, 36:9156–9172, 2023
2023
-
[15]
Vidar: Embodied Video Diffusion Model for Generalist Manipulation
Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation, 2025. URLhttps://arxiv.org/abs/2507.12898
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Veo-3 technical report
Google DeepMind. Veo-3 technical report. 2025. URLhttps://storage.googleapis.com/deepmind-media/veo/ Veo-3-Tech-Report.pdf
2025
-
[18]
Ctrl-World: A Controllable Generative World Model for Robot Manipulation
Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
LTX-2: Efficient Joint Audio-Visual Foundation Model
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Hailuo.Hailuo Lab, 2025
Hailuo. Hailuo.Hailuo Lab, 2025. URLhttps://hailuoai.video/
2025
-
[22]
Alltracker: Efficient dense point tracking at high resolution
Adam W Harley, Yang You, Xinglong Sun, et al. Alltracker: Efficient dense point tracking at high resolution. 2025
2025
-
[23]
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Yucheng Hu, Yanjiang Guo, Pengchao Wang, et al. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zh...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Cotracker3: Simpler and better point tracking by pseudo-labelling real videos, 2024
Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos, 2024. URLhttps://arxiv.org/ abs/2410.11831
-
[26]
Cotracker: It is better to track together, 2024
Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together, 2024. URLhttps://arxiv.org/abs/2307.07635
-
[27]
Image to video elements feature, 2025
Kling. Image to video elements feature, 2025. URLhttps://klingai.com/global/
2025
-
[28]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Causal World Modeling for Robot Control
Lin Li, Qihang Zhang, Yiming Luo, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Video Generators are Robot Policies
Junbang Liang, Pavel Tokmakov, Ruoshi Liu, et al. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
Yue Liao, Pengfei Zhou, Siyuan Huang, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Open x-embodiment: Robotic learning datasets and rt-x models
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, et al. Open x-embodiment: Robotic learning datasets and rt-x models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903, 2024
2024
-
[34]
OpenAI. Sora. 2024. URLhttps://openai.com/sora/
2024
-
[35]
Sora2, 2025
OpenAI. Sora2, 2025. URLhttps://openai.com/zh-Hans-CN/index/sora-2/
2025
-
[36]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022
2022
-
[37]
Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36, 2024
Rafael Rafailov, Archit Sharma, Eric Mitchell, et al. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36, 2024
2024
-
[38]
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Roboscape: Physics-informed embodied world model, 2025
Yu Shang, Xin Zhang, Yinzhou Tang, Lei Jin, Chen Gao, Wei Wu, and Yong Li. Roboscape: Physics-informed embodied world model, 2025. URLhttps://arxiv.org/abs/2506.23135
-
[40]
Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026. 12
-
[41]
Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025
-
[43]
Longcat-video technical report, 2025
Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, and Tong Zhang. Longcat-video technical report, 2025. URL https: //arxiv.org/abs/2510.22200
-
[44]
Unifolm-wma-0: A world-model-action (wma) framework under unifolm family, 2025
Unitree. Unifolm-wma-0: A world-model-action (wma) framework under unifolm family, 2025
2025
-
[45]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Yufei Wang, Zhou Xian, Feng Chen, et al. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023
-
[47]
HunyuanVideo 1.5 Technical Report
Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2, 2024. URLhttps://arxiv.org/abs/2406.09414
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Lujie Yang, HJ Suh, Tong Zhao, et al. Physics-driven data generation for contact-rich manipulation via trajectory optimization.arXiv preprint arXiv:2502.20382, 2025
-
[50]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
World Action Models are Zero-shot Policies
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[52]
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[53]
Packing input frame contexts in next-frame prediction models for video generation.Arxiv, 2025
Lvmin Zhang and Maneesh Agrawala. Packing input frame contexts in next-frame prediction models for video generation.Arxiv, 2025
2025
-
[54]
Ruicheng Zhang, Mingyang Zhang, Jun Zhou, Zhangrui Guo, Zunnan Xu, Xiaofan Liu, Zhizhou Zhong, Puxin Yan, Haocheng Luo, and Xiu Li. Mind-v: Hierarchical world model for long-horizon robotic manipulation with rl-based physical alignment, 2026. URLhttps://arxiv.org/abs/2512.06628
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[55]
Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Videorepa: Learning physics for video generation through relational alignment with foundation models.arXiv preprint arXiv:2505.23656, 2025
-
[56]
Tesseract: Learning 4d embodied world models.arXiv preprint arXiv:2504.20995, 2025
Haoyu Zhen, Qiao Sun, Hongxin Zhang, et al. Tesseract: Learning 4d embodied world models.arXiv preprint arXiv:2504.20995, 2025
-
[57]
Pai-bench: A comprehensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025
Fengzhe Zhou, Jiannan Huang, Jialuo Li, Deva Ramanan, and Humphrey Shi. Pai-bench: A comprehensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025
-
[58]
RoboDreamer: Learning Compositional World Models for Robot Imagination
Siyuan Zhou, Yilun Du, Jiaben Chen, et al. Robodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024. 13 PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation Appendix A More implementation details Thissectioncomplementsthemainpaperwiththeconcreteconfigurationoftheauxiliaryperceptionm...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.