arxiv: 2604.04707 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: no theorem link

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

Bohan Zeng, Bozhou Li, Chengzhuo Tong, Daili Hua, DataFlow Team, Hao Liang, Hongcheng Gao, Huanyao Zhang, Jialong Wu, Jianbin Zhao, Junbo Niu, Kaixin Zhu, Meiyi Qiang, Mike Zheng Shou, Mingkun Chang, Minglei Shi, Pengfei Wan, Qinhan Yu, Ruichuan An, Runhao Zhao, Tianyi Bai, Tianyu Guo, Wentao Zhang, Xiaochen Ma, Xinlong Chen, Xintao Wang, Xinyi Huang, Yang Shi, Yifan Dai, Yifan Yang, Yiren Song, Yisheng Pan, Yiwen Tang, Yuanxing Zhang, Yue Ding, Yuran Wang, Zekun Wang, Zhengpin Li, Zhiyou Xiao, Zhou Liu, Zimo Meng

Pith reviewed 2026-05-10 19:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords world modelsunified frameworkperceptioninteractionlong-term memorycodebaseinferenceAI definition

0 comments

The pith

A unified definition and codebase positions world models as perception-centered systems equipped with interaction and long-term memory to understand and predict complex environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a standardized definition of advanced world models and releases OpenWorldLib as an integrated inference framework. It centers the definition on perception while requiring capabilities for interaction and memory to handle real-world complexity. By categorizing essential capabilities and merging models from separate tasks into one codebase, the work aims to support reuse and joint operation. A sympathetic reader would care because this approach could reduce duplication in AI research and allow models to share components when simulating dynamic scenes or environments.

Core claim

The paper defines a world model as a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. It presents OpenWorldLib as the unified codebase that integrates models across tasks to enable efficient reuse and collaborative inference, while also providing a systematic categorization of required capabilities and reflections on future research directions.

What carries the argument

OpenWorldLib, the unified inference framework that merges perception-centered models with interaction and memory modules to support cross-task reuse and joint operation.

If this is right

Models developed for one world-model task become directly usable in others without major rewrites.
Perception, interaction, and memory components can operate together during a single inference pass.
Capability categorization provides a shared checklist for comparing and extending existing models.
Future extensions can add new modules while staying compatible with the existing structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The emphasis on long-term memory could shift design priorities toward architectures that maintain state over extended sequences rather than short-term predictions alone.
A shared codebase might surface hidden commonalities between vision-only and action-conditioned world models that separate implementations obscure.
Testing the framework on embodied robotics benchmarks could reveal whether the perception-first definition scales when real sensor noise and physical constraints are present.

Load-bearing premise

The proposed definition and unified framework will allow efficient reuse and collaborative inference across tasks without creating incompatibilities or reducing performance.

What would settle it

Implementing separate task models inside OpenWorldLib and measuring whether combined inference speed or accuracy drops below the sum of individual runs would falsify the claim if measurable losses appear.

read the original abstract

World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: https://github.com/OpenDCAI/OpenWorldLib

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenWorldLib gives a clear definition of world models plus a shared codebase, but the unification for reuse and collaborative inference has no benchmarks or interface details to show it actually works without added cost or conflicts.

read the letter

The main point to know is that this paper proposes a definition for advanced world models and releases OpenWorldLib as a unified inference framework. The definition frames a world model as perception-centered, with interaction and long-term memory for understanding and predicting the world. They organize capabilities into categories and describe how the library pulls models from different tasks into one setup for reuse and joint inference. They also add some reflections on future directions and link to the GitHub code.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces OpenWorldLib, a unified codebase and inference framework for advanced world models. It proposes a definition of a world model as a perception-centered model or framework equipped with interaction and long-term memory capabilities for understanding and predicting the complex world. It systematically categorizes essential capabilities, integrates models across different tasks in a unified framework to enable efficient reuse and collaborative inference, and offers reflections on future research directions.

Significance. If the integration claim holds and OpenWorldLib successfully enables reuse and collaborative inference without introducing incompatibilities or performance losses, the work could help standardize terminology and infrastructure in the growing area of world models, facilitating community collaboration through an open codebase and capability categorization.

major comments (1)

[Abstract] Abstract: The central claim that OpenWorldLib 'integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference' is load-bearing for the paper's value as a unified framework, yet the manuscript provides no interface specifications, adapter details, overhead measurements, cross-task performance retention numbers, or pseudocode for the unified inference path to support it.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting the need to better substantiate the integration claims. We have revised the manuscript to address this by expanding the framework description with the requested details.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that OpenWorldLib 'integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference' is load-bearing for the paper's value as a unified framework, yet the manuscript provides no interface specifications, adapter details, overhead measurements, cross-task performance retention numbers, or pseudocode for the unified inference path to support it.

Authors: We agree that the abstract claim would be strengthened by explicit supporting material in the text. The original manuscript provided a high-level overview of the unified framework and pointed to the open codebase for implementation details. In the revised version, we have added a dedicated subsection on the integration architecture that specifies the core interfaces, describes the adapter mechanisms for task-specific models, and includes pseudocode for the collaborative inference pipeline. We have also incorporated empirical results from our evaluations showing low computational overhead and high cross-task performance retention, confirming that the unification does not introduce incompatibilities or significant losses. These additions appear in the updated Sections 3 and 4. revision: yes

Circularity Check

0 steps flagged

No circularity: definition proposed directly and framework presented as engineering integration

full rationale

The paper states a definition of world models drawn from field evolution and describes OpenWorldLib as a codebase that integrates models based on that definition. No mathematical derivation chain, equations, fitted parameters, predictions, or self-citations are used to justify core claims. The integration assertion is a design statement rather than a result that reduces to its own inputs by construction. This matches the expected non-circular outcome for a definitional and engineering paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper rests on the assumption that a single definition can unify diverse world-model approaches and that a shared codebase will improve reuse; no free parameters, new entities, or non-standard axioms are introduced.

pith-pipeline@v0.9.0 · 5592 in / 1045 out tokens · 51453 ms · 2026-05-10T19:32:15.301056+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
cs.AI 2026-04 unverdicted novelty 4.0

TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.

Reference graph

Works this paper leans on

168 extracted references · 132 canonical work pages · cited by 1 Pith paper · 35 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review arXiv 2025
[2]

World Simulation with Video Foundation Models for Physical AI

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

work page internal anchor Pith review arXiv 2025
[3]

UniCTokens: Boosting personalized understanding and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. Unictokens: Boosting personalized understanding and generation via unified concept tokens. arXiv preprint arXiv:2505.14671, 2025

work page arXiv 2025
[4]

Genius: Generative fluid intelligence evaluation suite.arXiv preprint arXiv:2602.11144, 2026

Ruichuan An, Sihan Yang, Ziyu Guo, Wei Dai, Zijun Shen, Haodong Li, Renrui Zhang, Xinyu Wei, Guopeng Li, Wenshan Wu, et al. Genius: Generative fluid intelligence evaluation suite.arXiv preprint arXiv:2602.11144, 2026

work page arXiv 2026
[5]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review arXiv 2025
[6]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

2025
[7]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

A survey of multimodal large language model from a data-centric perspective, arXiv preprint arXiv:2405.16640v2, 2024

Tianyi Bai, Hao Liang, Binwang Wan, Yanran Xu, Xi Li, Shiyu Li, Ling Yang, Bozhou Li, Yifan Wang, Bin Cui, et al. A survey of multimodal large language model from a data-centric perspective.arXiv preprint arXiv:2405.16640, 2024

work page arXiv 2024
[9]

Multi-step visual reasoning with visual tokens scaling and verification.arXiv preprint arXiv:2506.07235, 2025

Tianyi Bai, Zengjie Hu, Fupeng Sun, Jiantao Qiu, Yizhen Jiang, Guangxin He, Bohan Zeng, Conghui He, Binhang Yuan, and Wentao Zhang. Multi-step visual reasoning with visual tokens scaling and verification.arXiv preprint arXiv:2506.07235, 2025

work page arXiv 2025
[10]

arXiv preprint arXiv:2510.20385 (2025)

Yunpeng Bai, Haoxiang Li, and Qixing Huang. Positional encoding field.arXiv preprint arXiv:2510.20385, 2025

work page arXiv 2025
[11]

The safety challenge of world models for embodied ai agents: a review

Lorenzo Baraldi, Zifan Zeng, Chongzhe Zhang, Aradhana Nayak, Hongbo Zhu, Feng Liu, Qunli Zhang, Peng Wang, Shiming Liu, Zheng Hu, et al. The safety challenge of world models for embodied ai agents: a review. arXiv preprint arXiv:2510.05865, 2025

work page arXiv 2025
[12]

V-jepa: Latent video prediction for visual representation learning

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning. 2023

2023
[13]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Lovr: A benchmark for long video retrieval in multimodal contexts.arXiv preprint arXiv:2505.13928, 2025

Qifeng Cai, Hao Liang, Hejun Dong, Meiyi Qiang, Ruichuan An, Zhaoyang Han, Zhengzhou Zhu, Bin Cui, and Wentao Zhang. Lovr: A benchmark for long video retrieval in multimodal contexts. arXiv preprint arXiv:2505.13928, 2025

work page arXiv 2025
[15]

Text2sql-flow: A robust sql-aware data augmentation framework for text-to-sql.arXiv preprint arXiv:2511.10192, 2025

Qifeng Cai, Hao Liang, Chang Xu, Tao Xie, Wentao Zhang, and Bin Cui. Text2sql-flow: A robust sql-aware data augmentation framework for text-to-sql.arXiv preprint arXiv:2511.10192, 2025

work page arXiv 2025
[16]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review arXiv 2025
[17]

Spatialvlm: En- dowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: En- dowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. OpenDCAI Technical Report 18

2024
[18]

Diadem: Advancing dialogue descriptions in audiovisual video captioning for multimodal large language models, 2026

Xinlong Chen, Weihong Lin, Jingyun Hua, Linli Yao, Yue Ding, Bozhou Li, Bohan Zeng, Yang Shi, Qiang Liu, Yuanxing Zhang, et al. Diadem: Advancing dialogue descriptions in audiovisual video captioning for multimodal large language models.arXiv preprint arXiv:2601.19267, 2026

work page arXiv 2026
[19]

Think with 3d: Geometric imagina- tion grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, and Ruqi Huang. Think with 3d: Geometric imagination grounded spatial reasoning from limited views. arXiv preprint arXiv:2510.18632, 2025

work page arXiv 2025
[20]

arXiv preprint arXiv:2509.22642 , year=

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction. arXiv preprint arXiv:2509.22642, 2025

work page arXiv 2025
[21]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Cwm: An open-weights llm for research on code generation with world models

Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. Cwm: An open-weights llm for research on code generation with world models. arXiv preprint arXiv:2510.02387, 2025

work page arXiv 2025
[23]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

work page arXiv 2025
[24]

Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025

2025
[25]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: humanlanguage technologies, volume1 (long and short papers), pages 4171–4186, 2019

2019
[26]

Understanding world or predicting future? a comprehensive survey of world models

Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys, 58(3):1–38, 2025

2025
[27]

Omnisift: Modality-asymmetric token compression for efficient omni-modal large language models.arXiv preprint arXiv:2602.04804, 2026

Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, et al. Omnisift: Modality-asymmetric token compression for efficient omni-modal large language models. arXiv preprint arXiv:2602.04804, 2026

work page internal anchor Pith review arXiv 2026
[28]

Mineru-diffusion: Rethinking document ocr as inverse rendering via diffusion decoding.arXiv preprint arXiv:2603.22458, 2026

Hejun Dong, Junbo Niu, Bin Wang, Weijun Zeng, Wentao Zhang, and Conghui He. Mineru-diffusion: Rethinking document ocr as inverse rendering via diffusion decoding.arXiv preprint arXiv:2603.22458, 2026

work page arXiv 2026
[29]

Web world models

Jichen Feng, Yifan Zhang, Chenggong Zhang, Yifu Lu, Shilong Liu, and Mengdi Wang. Web world models. arXiv preprint arXiv:2512.23676, 2025

work page arXiv 2025
[30]

A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025

Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025

work page arXiv 2025
[31]

Causalvqa: A physically grounded causal reasoning benchmark for video models.arXiv preprint arXiv:2506.09943, 2025

Aaron Foss, Chloe Evans, Sasha Mitts, Koustuv Sinha, Ammar Rizvi, and Justine T Kao. Causalvqa: A physically grounded causal reasoning benchmark for video models.arXiv preprint arXiv:2506.09943, 2025

work page arXiv 2025
[32]

Embodied ai agents: Modeling the world,

Pascale Fung, Yoram Bachrach, Asli Celikyilmaz, Kamalika Chaudhuri, Delong Chen, Willy Chung, Emmanuel Dupoux, Hongyu Gong, Hervé Jégou, Alessandro Lazaric, et al. Embodied ai agents: Modeling the world.arXiv preprint arXiv:2506.22355, 2025

work page arXiv 2025
[33]

Spatial reasoning with vision-language models in ego-centric multi-view scenes.arXiv preprint arXiv:2509.06266, 2025

Mohsen Gholami, Ahmad Rezaei, Zhou Weimin, Sitong Mao, Shunbo Zhou, Yong Zhang, and Mohammad Akbari. Spatial reasoning with vision-language models in ego-centric multi-view scenes.arXiv preprint arXiv:2509.06266, 2025

work page arXiv 2025
[34]

Osvi- wm: One-shot visual imitation for unseen tasks using world-model- guided trajectory generation,

Raktim Gautam Goswami, Prashanth Krishnamurthy, Yann LeCun, and Farshad Khorrami. Osvi-wm: One-shot visualimitationforunseentasksusingworld-model-guidedtrajectorygeneration. arXivpreprintarXiv:2505.20425, 2025. OpenDCAI Technical Report 19

work page arXiv 2025
[35]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review arXiv 2025
[38]

Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168, 2024

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprintarXiv:2407.03168, 2024

work page arXiv 2024
[39]

Brace: A benchmark for robust audio caption quality evaluation, 2025

Tianyu Guo, Hongyu Chen, Hao Liang, Meiyi Qiang, Bohan Zeng, Linzhuang Sun, Bin Cui, and Wentao Zhang. Brace: A benchmark for robust audio caption quality evaluation.arXiv preprint arXiv:2512.10403, 2025

work page arXiv 2025
[40]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018

work page internal anchor Pith review arXiv 2018
[41]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

work page internal anchor Pith review arXiv 1912
[42]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review arXiv 2023
[43]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model. arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review arXiv 2022
[45]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review arXiv 2023
[46]

Simulating the real world: A unified survey of multimodal generative models.arXiv preprint arXiv:2503.04641, 2025

Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, and Hui Xiong. Simulating the real world: A unified survey of multimodal generative models.arXiv preprint arXiv:2503.04641, 2025

work page arXiv 2025
[47]

Awesome-world-models, 2025

Siqiao Huang and Awesome-World-Models Contributors. Awesome-world-models, 2025. URLhttps://github. com/knightnemo/Awesome-World-Models

2025
[48]

Vid2world: Crafting video diffu- sionmodelstointeractiveworldmodels

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffu- sionmodelstointeractiveworldmodels. In TheFourteenthInternationalConferenceonLearningRepresentations, 2026

2026
[49]

Mobilevla-r1: Reinforcing vision-language-action for mobile robots.arXiv preprint arXiv:2511.17889, 2025

Ting Huang, Dongjian Li, Rui Yang, Zeyu Zhang, Zida Yang, and Hao Tang. Mobilevla-r1: Reinforcing vision-language-action for mobile robots.arXiv preprint arXiv:2511.17889, 2025

work page arXiv 2025
[50]

Mllms need 3d-aware representation supervision for scene understanding

Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware representation supervision for scene understanding. arXiv e-prints, pages arXiv–2506, 2025

2025
[51]

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, et al. Live avatar: Streaming real-time audio-driven avatar generation with infinite length. arXiv preprint arXiv:2512.04677, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

arXiv2506.15442(2025) 10

Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, Haolin Liu, Yunfei Zhao, et al. Hunyuan3d 2.1: From images to high-fidelity 3d assets with production-ready pbr material. arXiv preprint arXiv:2506.15442, 2025

work page arXiv 2025
[53]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025. OpenDCAI Technical Report 20

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024

work page arXiv 2024
[55]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414, 2025

work page internal anchor Pith review arXiv 2025
[56]

Causal reasoning and large language models: Opening a new frontier for causality.Transactionson Machine Learning Research, 2023

Emre Kiciman, Robert Ness, Amit Sharma, and Chenhao Tan. Causal reasoning and large language models: Opening a new frontier for causality.Transactionson Machine Learning Research, 2023

2023
[57]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

work page internal anchor Pith review arXiv 2017
[58]

arXiv preprint arXiv:2509.07996 (2025)

Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, et al. 3d and 4d world modeling: A survey.arXiv preprint arXiv:2509.07996, 2025

work page arXiv 2025
[59]

A path towards autonomous machine intelligence version 0.9

Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022

2022
[60]

arXiv preprint arXiv:2510.18313 (2025)

Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, et al. Omninwm: Omniscient driving navigation world models.arXiv preprint arXiv:2510.18313, 2025

work page arXiv 2025
[61]

Semantic routing: Exploring multi-layer llm feature weighting for diffusion transformers

Bozhou Li, Yushuo Guan, Haolin Li, Bohan Zeng, Yiyan Ji, Yue Ding, Pengfei Wan, Kun Gai, Yuanxing Zhang, and Wentao Zhang. Semantic routing: Exploring multi-layer llm feature weighting for diffusion transformers. arXiv preprint arXiv:2602.03510, 2026

work page arXiv 2026
[62]

Momagen: Gener- ating demonstrations under soft and hard constraints for multi-step bimanual mobile manipulation.arXiv preprint arXiv:2510.18316, 2025

Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang, Huang Huang, Josiah Wong, Sujay Garlanka, Cem Gokmen, Ruohan Zhang, et al. Momagen: Generating demonstrations under soft and hard constraints for multi-step bimanual mobile manipulation.arXiv preprint arXiv:2510.18316, 2025

work page arXiv 2025
[63]

Haodong Li, Wangguangdong Zheng, Jing He, Yuhao Liu, Xin Lin, Xin Yang, Ying-Cong Chen, and Chunchao Guo.DA 2: Depth anything in any direction.arXiv preprint arXiv:2509.26618, 2025

work page arXiv 2025
[64]

Spatialladder: Progressive train- ing for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531, 2025

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models. arXiv preprint arXiv:2510.08531, 2025

work page arXiv 2025
[65]

Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition,

Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition. arXiv preprint arXiv:2506.17201, 2025

work page arXiv 2025
[66]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742, 2023

2023
[67]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review arXiv 2026
[68]

Worldgrow: Generating infinite 3d world

Sikuang Li, Chen Yang, Jiemin Fang, Taoran Yi, Jia Lu, Jiazhong Cen, Lingxi Xie, Wei Shen, and Qi Tian. Worldgrow: Generating infinite 3d world. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6433–6441, 2026

2026
[69]

Drivingdiffusion: Layout-guided multi-view driving scenarios video generation with latent diffusion model

Xiaofan Li, Yifu Zhang, and Xiaoqing Ye. Drivingdiffusion: Layout-guided multi-view driving scenarios video generation with latent diffusion model. InEuropean Conference on Computer Vision, pages 469–485. Springer, 2024

2024
[70]

A comprehensive survey on world models for embodied AI.arXiv preprint arXiv:2510.16732, 2024

Xinqing Li, Xin He, Le Zhang, Min Wu, Xiaoli Li, and Yun Liu. A comprehensive survey on world models for embodied ai. arXiv preprint arXiv:2510.16732, 2025

work page arXiv 2025
[71]

Flash- world: High-quality 3d scene generation within seconds.arXiv preprint arXiv:2510.13678,

Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, and Liujuan Cao. Flashworld: High-quality 3d scene generation within seconds.arXiv preprint arXiv:2510.13678, 2025

work page arXiv 2025
[72]

Evqascore: Efficient video question answering data evaluation

Hao Liang, Zirong Chen, and Wentao Zhang. Evqascore: Efficient video question answering data evaluation. arXiv preprint arXiv:2411.06908, 2024. OpenDCAI Technical Report 21

work page arXiv 2024
[73]

Keyvideollm: Towards large-scale video keyframe selection.arXiv preprint arXiv:2407.03104, 2024

Hao Liang, Jiapeng Li, Tianyi Bai, Xijie Huang, Linzhuang Sun, Zhengren Wang, Conghui He, Bin Cui, Chong Chen, and Wentao Zhang. Keyvideollm: Towards large-scale video keyframe selection.arXiv preprint arXiv:2407.03104, 2024

work page arXiv 2024
[74]

Synth-empathy: Towards high-quality synthetic empathy data.arXiv preprint arXiv:2407.21669, 2024

Hao Liang, Linzhuang Sun, Jingxuan Wei, Xijie Huang, Linkun Sun, Bihui Yu, Conghui He, and Wentao Zhang. Synth-empathy: Towards high-quality synthetic empathy data.arXiv preprint arXiv:2407.21669, 2024

work page arXiv 2024
[75]

Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025

Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, et al. Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025

work page arXiv 2025
[76]

Mathclean: A benchmark for synthetic mathematical data cleaning.arXiv preprint arXiv:2502.19058, 2025

Hao Liang, Meiyi Qiang, Yuying Li, Zefeng He, Yongzhen Guo, Zhengzhou Zhu, Wentao Zhang, and Bin Cui. Mathclean: A benchmark for synthetic mathematical data cleaning.arXiv preprint arXiv:2502.19058, 2025

work page arXiv 2025
[77]

Multimodal reasoning for science: Technical report and 1st place solution to the icml 2025 seephys challenge.arXiv preprint arXiv:2509.06079, 2025

Hao Liang, Ruitao Wu, Bohan Zeng, Junbo Niu, Wentao Zhang, and Bin Dong. Multimodal reasoning for science: Technical report and 1st place solution to the icml 2025 seephys challenge.arXiv preprint arXiv:2509.06079, 2025

work page arXiv 2025
[78]

Data preparation for large language models.Journal of Computer Science and Technology, 2026

Hao Liang, Zhen Hao Wong, Ruitong Liu, Yuhan Wang, Meiyi Qiang, Zhengyang Zhao, Chengyu Shen, Conghui He, Wentao Zhang, and Bin Cui. Data preparation for large language models.Journal of Computer Science and Technology, 2026. doi: 10.1007/s11390-026-5948-8

work page doi:10.1007/s11390-026-5948-8 2026
[79]

Towards next-generation llm training: From the data-centric perspective

Hao Liang, Zhengyang Zhao, Zhaoyang Han, Meiyi Qiang, Xiaochen Ma, Bohan Zeng, Qifeng Cai, Zhiyu Li, Linpeng Tang, Wentao Zhang, et al. Towards next-generation llm training: From the data-centric perspective. arXiv preprint arXiv:2603.14712, 2026

work page arXiv 2026
[80]

Liang, Z

Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma, Rongyi Yu, Hengyi Feng, Shixuan Sun, Zimo Meng, Xiaochen Ma, et al. Dataflex: A unified framework for data-centric dynamic training of large language models. arXiv preprint arXiv:2603.26164, 2026

work page arXiv 2026

Showing first 80 references.