pith. machine review for the scientific record. sign in

arxiv: 2604.04707 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: no theorem link

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

Bohan Zeng, Bozhou Li, Chengzhuo Tong, Daili Hua, DataFlow Team, Hao Liang, Hongcheng Gao, Huanyao Zhang, Jialong Wu, Jianbin Zhao, Junbo Niu, Kaixin Zhu, Meiyi Qiang, Mike Zheng Shou, Mingkun Chang, Minglei Shi, Pengfei Wan, Qinhan Yu, Ruichuan An, Runhao Zhao, Tianyi Bai, Tianyu Guo, Wentao Zhang, Xiaochen Ma, Xinlong Chen, Xintao Wang, Xinyi Huang, Yang Shi, Yifan Dai, Yifan Yang, Yiren Song, Yisheng Pan, Yiwen Tang, Yuanxing Zhang, Yue Ding, Yuran Wang, Zekun Wang, Zhengpin Li, Zhiyou Xiao, Zhou Liu, Zimo Meng

Pith reviewed 2026-05-10 19:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords world modelsunified frameworkperceptioninteractionlong-term memorycodebaseinferenceAI definition
0
0 comments X

The pith

A unified definition and codebase positions world models as perception-centered systems equipped with interaction and long-term memory to understand and predict complex environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a standardized definition of advanced world models and releases OpenWorldLib as an integrated inference framework. It centers the definition on perception while requiring capabilities for interaction and memory to handle real-world complexity. By categorizing essential capabilities and merging models from separate tasks into one codebase, the work aims to support reuse and joint operation. A sympathetic reader would care because this approach could reduce duplication in AI research and allow models to share components when simulating dynamic scenes or environments.

Core claim

The paper defines a world model as a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. It presents OpenWorldLib as the unified codebase that integrates models across tasks to enable efficient reuse and collaborative inference, while also providing a systematic categorization of required capabilities and reflections on future research directions.

What carries the argument

OpenWorldLib, the unified inference framework that merges perception-centered models with interaction and memory modules to support cross-task reuse and joint operation.

If this is right

  • Models developed for one world-model task become directly usable in others without major rewrites.
  • Perception, interaction, and memory components can operate together during a single inference pass.
  • Capability categorization provides a shared checklist for comparing and extending existing models.
  • Future extensions can add new modules while staying compatible with the existing structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The emphasis on long-term memory could shift design priorities toward architectures that maintain state over extended sequences rather than short-term predictions alone.
  • A shared codebase might surface hidden commonalities between vision-only and action-conditioned world models that separate implementations obscure.
  • Testing the framework on embodied robotics benchmarks could reveal whether the perception-first definition scales when real sensor noise and physical constraints are present.

Load-bearing premise

The proposed definition and unified framework will allow efficient reuse and collaborative inference across tasks without creating incompatibilities or reducing performance.

What would settle it

Implementing separate task models inside OpenWorldLib and measuring whether combined inference speed or accuracy drops below the sum of individual runs would falsify the claim if measurable losses appear.

read the original abstract

World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: https://github.com/OpenDCAI/OpenWorldLib

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces OpenWorldLib, a unified codebase and inference framework for advanced world models. It proposes a definition of a world model as a perception-centered model or framework equipped with interaction and long-term memory capabilities for understanding and predicting the complex world. It systematically categorizes essential capabilities, integrates models across different tasks in a unified framework to enable efficient reuse and collaborative inference, and offers reflections on future research directions.

Significance. If the integration claim holds and OpenWorldLib successfully enables reuse and collaborative inference without introducing incompatibilities or performance losses, the work could help standardize terminology and infrastructure in the growing area of world models, facilitating community collaboration through an open codebase and capability categorization.

major comments (1)
  1. [Abstract] Abstract: The central claim that OpenWorldLib 'integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference' is load-bearing for the paper's value as a unified framework, yet the manuscript provides no interface specifications, adapter details, overhead measurements, cross-task performance retention numbers, or pseudocode for the unified inference path to support it.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting the need to better substantiate the integration claims. We have revised the manuscript to address this by expanding the framework description with the requested details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that OpenWorldLib 'integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference' is load-bearing for the paper's value as a unified framework, yet the manuscript provides no interface specifications, adapter details, overhead measurements, cross-task performance retention numbers, or pseudocode for the unified inference path to support it.

    Authors: We agree that the abstract claim would be strengthened by explicit supporting material in the text. The original manuscript provided a high-level overview of the unified framework and pointed to the open codebase for implementation details. In the revised version, we have added a dedicated subsection on the integration architecture that specifies the core interfaces, describes the adapter mechanisms for task-specific models, and includes pseudocode for the collaborative inference pipeline. We have also incorporated empirical results from our evaluations showing low computational overhead and high cross-task performance retention, confirming that the unification does not introduce incompatibilities or significant losses. These additions appear in the updated Sections 3 and 4. revision: yes

Circularity Check

0 steps flagged

No circularity: definition proposed directly and framework presented as engineering integration

full rationale

The paper states a definition of world models drawn from field evolution and describes OpenWorldLib as a codebase that integrates models based on that definition. No mathematical derivation chain, equations, fitted parameters, predictions, or self-citations are used to justify core claims. The integration assertion is a design statement rather than a result that reduces to its own inputs by construction. This matches the expected non-circular outcome for a definitional and engineering paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper rests on the assumption that a single definition can unify diverse world-model approaches and that a shared codebase will improve reuse; no free parameters, new entities, or non-standard axioms are introduced.

pith-pipeline@v0.9.0 · 5592 in / 1045 out tokens · 51453 ms · 2026-05-10T19:32:15.301056+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

    cs.AI 2026-04 unverdicted novelty 4.0

    TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.

Reference graph

Works this paper leans on

168 extracted references · 132 canonical work pages · cited by 1 Pith paper · 35 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    World Simulation with Video Foundation Models for Physical AI

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

  3. [3]

    UniCTokens: Boosting personalized understanding and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

    Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. Unictokens: Boosting personalized understanding and generation via unified concept tokens. arXiv preprint arXiv:2505.14671, 2025

  4. [4]

    Genius: Generative fluid intelligence evaluation suite.arXiv preprint arXiv:2602.11144, 2026

    Ruichuan An, Sihan Yang, Ziyu Guo, Wei Dai, Zijun Shen, Haodong Li, Renrui Zhang, Xinyu Wei, Guopeng Li, Wenshan Wu, et al. Genius: Generative fluid intelligence evaluation suite.arXiv preprint arXiv:2602.11144, 2026

  5. [5]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  6. [6]

    Recammaster: Camera-controlled generative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

  7. [7]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  8. [8]

    A survey of multimodal large language model from a data-centric perspective, arXiv preprint arXiv:2405.16640v2, 2024

    Tianyi Bai, Hao Liang, Binwang Wan, Yanran Xu, Xi Li, Shiyu Li, Ling Yang, Bozhou Li, Yifan Wang, Bin Cui, et al. A survey of multimodal large language model from a data-centric perspective.arXiv preprint arXiv:2405.16640, 2024

  9. [9]

    Multi-step visual reasoning with visual tokens scaling and verification.arXiv preprint arXiv:2506.07235, 2025

    Tianyi Bai, Zengjie Hu, Fupeng Sun, Jiantao Qiu, Yizhen Jiang, Guangxin He, Bohan Zeng, Conghui He, Binhang Yuan, and Wentao Zhang. Multi-step visual reasoning with visual tokens scaling and verification.arXiv preprint arXiv:2506.07235, 2025

  10. [10]

    arXiv preprint arXiv:2510.20385 (2025)

    Yunpeng Bai, Haoxiang Li, and Qixing Huang. Positional encoding field.arXiv preprint arXiv:2510.20385, 2025

  11. [11]

    The safety challenge of world models for embodied ai agents: a review

    Lorenzo Baraldi, Zifan Zeng, Chongzhe Zhang, Aradhana Nayak, Hongbo Zhu, Feng Liu, Qunli Zhang, Peng Wang, Shiming Liu, Zheng Hu, et al. The safety challenge of world models for embodied ai agents: a review. arXiv preprint arXiv:2510.05865, 2025

  12. [12]

    V-jepa: Latent video prediction for visual representation learning

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning. 2023

  13. [13]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  14. [14]

    Lovr: A benchmark for long video retrieval in multimodal contexts.arXiv preprint arXiv:2505.13928, 2025

    Qifeng Cai, Hao Liang, Hejun Dong, Meiyi Qiang, Ruichuan An, Zhaoyang Han, Zhengzhou Zhu, Bin Cui, and Wentao Zhang. Lovr: A benchmark for long video retrieval in multimodal contexts. arXiv preprint arXiv:2505.13928, 2025

  15. [15]

    Text2sql-flow: A robust sql-aware data augmentation framework for text-to-sql.arXiv preprint arXiv:2511.10192, 2025

    Qifeng Cai, Hao Liang, Chang Xu, Tao Xie, Wentao Zhang, and Bin Cui. Text2sql-flow: A robust sql-aware data augmentation framework for text-to-sql.arXiv preprint arXiv:2511.10192, 2025

  16. [16]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  17. [17]

    Spatialvlm: En- dowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: En- dowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. OpenDCAI Technical Report 18

  18. [18]

    Diadem: Advancing dialogue descriptions in audiovisual video captioning for multimodal large language models, 2026

    Xinlong Chen, Weihong Lin, Jingyun Hua, Linli Yao, Yue Ding, Bozhou Li, Bohan Zeng, Yang Shi, Qiang Liu, Yuanxing Zhang, et al. Diadem: Advancing dialogue descriptions in audiovisual video captioning for multimodal large language models.arXiv preprint arXiv:2601.19267, 2026

  19. [19]

    Think with 3d: Geometric imagina- tion grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025

    Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, and Ruqi Huang. Think with 3d: Geometric imagination grounded spatial reasoning from limited views. arXiv preprint arXiv:2510.18632, 2025

  20. [20]

    arXiv preprint arXiv:2509.22642 , year=

    Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction. arXiv preprint arXiv:2509.22642, 2025

  21. [21]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  22. [22]

    Cwm: An open-weights llm for research on code generation with world models

    Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. Cwm: An open-weights llm for research on code generation with world models. arXiv preprint arXiv:2510.02387, 2025

  23. [23]

    Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

  24. [24]

    Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025

  25. [25]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: humanlanguage technologies, volume1 (long and short papers), pages 4171–4186, 2019

  26. [26]

    Understanding world or predicting future? a comprehensive survey of world models

    Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys, 58(3):1–38, 2025

  27. [27]

    Omnisift: Modality-asymmetric token compression for efficient omni-modal large language models.arXiv preprint arXiv:2602.04804, 2026

    Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, et al. Omnisift: Modality-asymmetric token compression for efficient omni-modal large language models. arXiv preprint arXiv:2602.04804, 2026

  28. [28]

    Mineru-diffusion: Rethinking document ocr as inverse rendering via diffusion decoding.arXiv preprint arXiv:2603.22458, 2026

    Hejun Dong, Junbo Niu, Bin Wang, Weijun Zeng, Wentao Zhang, and Conghui He. Mineru-diffusion: Rethinking document ocr as inverse rendering via diffusion decoding.arXiv preprint arXiv:2603.22458, 2026

  29. [29]

    Web world models

    Jichen Feng, Yifan Zhang, Chenggong Zhang, Yifu Lu, Shilong Liu, and Mengdi Wang. Web world models. arXiv preprint arXiv:2512.23676, 2025

  30. [30]

    A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025

    Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025

  31. [31]

    Causalvqa: A physically grounded causal reasoning benchmark for video models.arXiv preprint arXiv:2506.09943, 2025

    Aaron Foss, Chloe Evans, Sasha Mitts, Koustuv Sinha, Ammar Rizvi, and Justine T Kao. Causalvqa: A physically grounded causal reasoning benchmark for video models.arXiv preprint arXiv:2506.09943, 2025

  32. [32]

    Embodied ai agents: Modeling the world,

    Pascale Fung, Yoram Bachrach, Asli Celikyilmaz, Kamalika Chaudhuri, Delong Chen, Willy Chung, Emmanuel Dupoux, Hongyu Gong, Hervé Jégou, Alessandro Lazaric, et al. Embodied ai agents: Modeling the world.arXiv preprint arXiv:2506.22355, 2025

  33. [33]

    Spatial reasoning with vision-language models in ego-centric multi-view scenes.arXiv preprint arXiv:2509.06266, 2025

    Mohsen Gholami, Ahmad Rezaei, Zhou Weimin, Sitong Mao, Shunbo Zhou, Yong Zhang, and Mohammad Akbari. Spatial reasoning with vision-language models in ego-centric multi-view scenes.arXiv preprint arXiv:2509.06266, 2025

  34. [34]

    Osvi- wm: One-shot visual imitation for unseen tasks using world-model- guided trajectory generation,

    Raktim Gautam Goswami, Prashanth Krishnamurthy, Yann LeCun, and Farshad Khorrami. Osvi-wm: One-shot visualimitationforunseentasksusingworld-model-guidedtrajectorygeneration. arXivpreprintarXiv:2505.20425, 2025. OpenDCAI Technical Report 19

  35. [35]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024

  36. [36]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  37. [37]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  38. [38]

    Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168, 2024

    Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprintarXiv:2407.03168, 2024

  39. [39]

    Brace: A benchmark for robust audio caption quality evaluation, 2025

    Tianyu Guo, Hongyu Chen, Hao Liang, Meiyi Qiang, Bohan Zeng, Linzhuang Sun, Bin Cui, and Wentao Zhang. Brace: A benchmark for robust audio caption quality evaluation.arXiv preprint arXiv:2512.10403, 2025

  40. [40]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018

  41. [41]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

  42. [42]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

  43. [43]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model. arXiv preprint arXiv:2508.13009, 2025

  44. [44]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022

  45. [45]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

  46. [46]

    Simulating the real world: A unified survey of multimodal generative models.arXiv preprint arXiv:2503.04641, 2025

    Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, and Hui Xiong. Simulating the real world: A unified survey of multimodal generative models.arXiv preprint arXiv:2503.04641, 2025

  47. [47]

    Awesome-world-models, 2025

    Siqiao Huang and Awesome-World-Models Contributors. Awesome-world-models, 2025. URLhttps://github. com/knightnemo/Awesome-World-Models

  48. [48]

    Vid2world: Crafting video diffu- sionmodelstointeractiveworldmodels

    Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffu- sionmodelstointeractiveworldmodels. In TheFourteenthInternationalConferenceonLearningRepresentations, 2026

  49. [49]

    Mobilevla-r1: Reinforcing vision-language-action for mobile robots.arXiv preprint arXiv:2511.17889, 2025

    Ting Huang, Dongjian Li, Rui Yang, Zeyu Zhang, Zida Yang, and Hao Tang. Mobilevla-r1: Reinforcing vision-language-action for mobile robots.arXiv preprint arXiv:2511.17889, 2025

  50. [50]

    Mllms need 3d-aware representation supervision for scene understanding

    Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware representation supervision for scene understanding. arXiv e-prints, pages arXiv–2506, 2025

  51. [51]

    Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

    Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, et al. Live avatar: Streaming real-time audio-driven avatar generation with infinite length. arXiv preprint arXiv:2512.04677, 2025

  52. [52]

    arXiv2506.15442(2025) 10

    Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, Haolin Liu, Yunfei Zhao, et al. Hunyuan3d 2.1: From images to high-fidelity 3d assets with production-ready pbr material. arXiv preprint arXiv:2506.15442, 2025

  53. [53]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025. OpenDCAI Technical Report 20

  54. [54]

    Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024

    Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024

  55. [55]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414, 2025

  56. [56]

    Causal reasoning and large language models: Opening a new frontier for causality.Transactionson Machine Learning Research, 2023

    Emre Kiciman, Robert Ness, Amit Sharma, and Chenhao Tan. Causal reasoning and large language models: Opening a new frontier for causality.Transactionson Machine Learning Research, 2023

  57. [57]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

  58. [58]

    arXiv preprint arXiv:2509.07996 (2025)

    Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al. 3d and 4d world modeling: A survey.arXiv preprint arXiv:2509.07996, 2025

  59. [59]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022

  60. [60]

    arXiv preprint arXiv:2510.18313 (2025)

    Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, et al. Omninwm: Omniscient driving navigation world models.arXiv preprint arXiv:2510.18313, 2025

  61. [61]

    Semantic routing: Exploring multi-layer llm feature weighting for diffusion transformers

    Bozhou Li, Yushuo Guan, Haolin Li, Bohan Zeng, Yiyan Ji, Yue Ding, Pengfei Wan, Kun Gai, Yuanxing Zhang, and Wentao Zhang. Semantic routing: Exploring multi-layer llm feature weighting for diffusion transformers. arXiv preprint arXiv:2602.03510, 2026

  62. [62]

    Momagen: Gener- ating demonstrations under soft and hard constraints for multi-step bimanual mobile manipulation.arXiv preprint arXiv:2510.18316, 2025

    Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang, Huang Huang, Josiah Wong, Sujay Garlanka, Cem Gokmen, Ruohan Zhang, et al. Momagen: Generating demonstrations under soft and hard constraints for multi-step bimanual mobile manipulation.arXiv preprint arXiv:2510.18316, 2025

  63. [63]

    Haodong Li, Wangguangdong Zheng, Jing He, Yuhao Liu, Xin Lin, Xin Yang, Ying-Cong Chen, and Chunchao Guo.DA 2: Depth anything in any direction.arXiv preprint arXiv:2509.26618, 2025

  64. [64]

    Spatialladder: Progressive train- ing for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531, 2025

    Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models. arXiv preprint arXiv:2510.08531, 2025

  65. [65]

    Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition,

    Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition. arXiv preprint arXiv:2506.17201, 2025

  66. [66]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742, 2023

  67. [67]

    Causal World Modeling for Robot Control

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  68. [68]

    Worldgrow: Generating infinite 3d world

    Sikuang Li, Chen Yang, Jiemin Fang, Taoran Yi, Jia Lu, Jiazhong Cen, Lingxi Xie, Wei Shen, and Qi Tian. Worldgrow: Generating infinite 3d world. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6433–6441, 2026

  69. [69]

    Drivingdiffusion: Layout-guided multi-view driving scenarios video generation with latent diffusion model

    Xiaofan Li, Yifu Zhang, and Xiaoqing Ye. Drivingdiffusion: Layout-guided multi-view driving scenarios video generation with latent diffusion model. InEuropean Conference on Computer Vision, pages 469–485. Springer, 2024

  70. [70]

    A comprehensive survey on world models for embodied AI.arXiv preprint arXiv:2510.16732, 2024

    Xinqing Li, Xin He, Le Zhang, Min Wu, Xiaoli Li, and Yun Liu. A comprehensive survey on world models for embodied ai. arXiv preprint arXiv:2510.16732, 2025

  71. [71]

    Flash- world: High-quality 3d scene generation within seconds.arXiv preprint arXiv:2510.13678,

    Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, and Liujuan Cao. Flashworld: High-quality 3d scene generation within seconds.arXiv preprint arXiv:2510.13678, 2025

  72. [72]

    Evqascore: Efficient video question answering data evaluation

    Hao Liang, Zirong Chen, and Wentao Zhang. Evqascore: Efficient video question answering data evaluation. arXiv preprint arXiv:2411.06908, 2024. OpenDCAI Technical Report 21

  73. [73]

    Keyvideollm: Towards large-scale video keyframe selection.arXiv preprint arXiv:2407.03104, 2024

    Hao Liang, Jiapeng Li, Tianyi Bai, Xijie Huang, Linzhuang Sun, Zhengren Wang, Conghui He, Bin Cui, Chong Chen, and Wentao Zhang. Keyvideollm: Towards large-scale video keyframe selection.arXiv preprint arXiv:2407.03104, 2024

  74. [74]

    Synth-empathy: Towards high-quality synthetic empathy data.arXiv preprint arXiv:2407.21669, 2024

    Hao Liang, Linzhuang Sun, Jingxuan Wei, Xijie Huang, Linkun Sun, Bihui Yu, Conghui He, and Wentao Zhang. Synth-empathy: Towards high-quality synthetic empathy data.arXiv preprint arXiv:2407.21669, 2024

  75. [75]

    Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025

    Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, et al. Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025

  76. [76]

    Mathclean: A benchmark for synthetic mathematical data cleaning.arXiv preprint arXiv:2502.19058, 2025

    Hao Liang, Meiyi Qiang, Yuying Li, Zefeng He, Yongzhen Guo, Zhengzhou Zhu, Wentao Zhang, and Bin Cui. Mathclean: A benchmark for synthetic mathematical data cleaning.arXiv preprint arXiv:2502.19058, 2025

  77. [77]

    Multimodal reasoning for science: Technical report and 1st place solution to the icml 2025 seephys challenge.arXiv preprint arXiv:2509.06079, 2025

    Hao Liang, Ruitao Wu, Bohan Zeng, Junbo Niu, Wentao Zhang, and Bin Dong. Multimodal reasoning for science: Technical report and 1st place solution to the icml 2025 seephys challenge.arXiv preprint arXiv:2509.06079, 2025

  78. [78]

    Data preparation for large language models.Journal of Computer Science and Technology, 2026

    Hao Liang, Zhen Hao Wong, Ruitong Liu, Yuhan Wang, Meiyi Qiang, Zhengyang Zhao, Chengyu Shen, Conghui He, Wentao Zhang, and Bin Cui. Data preparation for large language models.Journal of Computer Science and Technology, 2026. doi: 10.1007/s11390-026-5948-8

  79. [79]

    Towards next-generation llm training: From the data-centric perspective

    Hao Liang, Zhengyang Zhao, Zhaoyang Han, Meiyi Qiang, Xiaochen Ma, Bohan Zeng, Qifeng Cai, Zhiyu Li, Linpeng Tang, Wentao Zhang, et al. Towards next-generation llm training: From the data-centric perspective. arXiv preprint arXiv:2603.14712, 2026

  80. [80]

    Liang, Z

    Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma, Rongyi Yu, Hengyi Feng, Shixuan Sun, Zimo Meng, Xiaochen Ma, et al. Dataflex: A unified framework for data-centric dynamic training of large language models. arXiv preprint arXiv:2603.26164, 2026

Showing first 80 references.