pith. sign in

arxiv: 2605.17912 · v1 · pith:F7VFIDNEnew · submitted 2026-05-18 · 💻 cs.RO · cs.CV

WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

Pith reviewed 2026-05-20 10:56 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords embodied world modelsbenchmarkingvisuotactile perceptionreinforcement learning environmentsrobotic platformsmultimodal evaluationcross-platform performance
0
0 comments X

The pith

WorldArena 2.0 broadens embodied world model testing to touch sensing, interactive policy training, and real robot platforms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WorldArena 2.0 as an expanded benchmark for world models that let agents predict action outcomes and environmental dynamics. Prior benchmarks stayed limited to vision-only forecasts, offline tasks, and pure simulation, leaving gaps as models grow more capable. The update adds tactile input alongside vision, turns models into active training environments for reinforcement learning policies, and shifts evaluation onto both simulated and physical robots with varied bodies. A single protocol then measures how well models perceive, support interaction, and transfer across these settings. This creates a shared way to monitor advances in embodied intelligence.

Core claim

WorldArena 2.0 extends embodied world model evaluation along three axes: modality from vision-only to visuotactile, functionality from policy evaluation and planning to use as interactive RL environments for policy optimization, and platform from simulator-only to a suite of simulated and real-world robotic settings across multiple embodiments, all under a standardized protocol that assesses perceptual quality, interactive utility, and cross-platform performance.

What carries the argument

The WorldArena 2.0 benchmark, which broadens evaluation along modality, functionality, and platform dimensions under one standardized protocol.

If this is right

  • World models become testable for their ability to incorporate tactile signals into future predictions.
  • The same model can now be evaluated both as a predictor and as a live environment that improves robot policies through trial-and-error interaction.
  • Performance gaps between simulation and physical robots can be quantified directly for each model.
  • Progress in embodied world models can be tracked consistently across vision, touch, planning, and real hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A shared benchmark of this form could reduce duplication of effort when different research groups test new world-model architectures.
  • Extending the protocol later to include audio or proprioception would follow the same logic already used for touch.
  • Models that rank high here may still require separate checks for long-horizon safety before deployment on physical systems.

Load-bearing premise

The chosen additions for modality, functionality, and platform together with the fixed protocol are enough to judge increasingly capable world models without adding new evaluation biases or missing important gaps.

What would settle it

If top-scoring models on WorldArena 2.0 show no corresponding gains in real-world task success rates outside the benchmark suite, the claim that it provides a sufficient testbed would be weakened.

Figures

Figures reproduced from arXiv: 2605.17912 by Chen Gao, Dhruv Shah, Gordon Wetzstein, Haisheng Su, Haoyi Duan, Jun Zhu, Lei Jin, Tat-Seng Chua, Weikang Su, Wei Wu, Weizhen He, Wenwu Zhu, Xihui Liu, Xin Jin, Xin Zhang, Yiding Ma, Yinzhou Tang, Yonghong Tian, Yong Li, Yu Shang, Zhaolu Wang, Zhaoxiang Zhang, Zhibo Chen, Zhuohang Li, Ziyou Wang.

Figure 1
Figure 1. Figure 1: Overview of the extension from WorldArena to WorldArena 2.0 along three dimensions: [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The standardized visuotactile world model architecture design (a) and the visuotactile [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: WorldArena 2.0 framework for leveraging world models as RL environments, comprising [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of three test platforms in WorldArena 2.0: RoboTwin 2.0, LIBERO, and a [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Relationship between environment training steps and the policy success rate in Click Bell [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cross-platform task success rate correlation between RoboTwin, LIBERO and the real [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cross-platform video quality correlation between RoboTwin and LIBERO. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cross-platform video quality correlation between RoboTwin and the real-world robotic [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cross-platform video quality correlation between LIBERO and the real-world robotic data. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces WorldArena 2.0 as an expanded benchmark for embodied world models. It extends evaluation along three axes: modality (adding visuotactile to vision-only), functionality (adding use as interactive RL environments for policy optimization beyond planning and offline evaluation), and platform (adding diverse simulated and real robotic embodiments beyond simulators). A standardized protocol is proposed to assess perceptual quality, interactive utility, and cross-platform performance, with the benchmark released at https://world-arena.ai.

Significance. If the extensions are accompanied by concrete validation data and closed-loop transfer results, the benchmark could serve as a useful standardized testbed for tracking progress on more comprehensive embodied world models, filling gaps left by existing vision-only, offline, and simulator-centric suites.

major comments (2)
  1. [Functionality dimension] Functionality dimension (as described in the abstract and § on extensions): the claim that world models can be assessed as interactive RL environments requires closed-loop policy transfer experiments. Policies optimized inside the world model must be transferred to the target simulated or real platforms and compared against direct baselines; without such results, in-model success rates risk inflation from compounding inaccuracies and do not substantiate the interactive utility dimension.
  2. [Abstract] Abstract and overall evaluation claims: the manuscript describes the intended extensions and standardized protocol but supplies no concrete results, validation data, error analysis, or quantitative tables. This absence prevents verification that the chosen modality, functionality, and platform extensions actually deliver comprehensive coverage without new biases or gaps.
minor comments (1)
  1. [Benchmark release] The benchmark availability statement could include explicit details on protocol documentation, dataset access, and reproducibility instructions beyond the URL.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing WorldArena 2.0. We address the major comments point by point below, agreeing where revisions are needed to strengthen the validation of the benchmark extensions.

read point-by-point responses
  1. Referee: [Functionality dimension] Functionality dimension (as described in the abstract and § on extensions): the claim that world models can be assessed as interactive RL environments requires closed-loop policy transfer experiments. Policies optimized inside the world model must be transferred to the target simulated or real platforms and compared against direct baselines; without such results, in-model success rates risk inflation from compounding inaccuracies and do not substantiate the interactive utility dimension.

    Authors: We agree that demonstrating closed-loop policy transfer is important for fully substantiating the interactive utility dimension. The manuscript currently focuses on establishing the standardized protocol for using world models as RL environments and includes preliminary in-model optimization results along with some cross-platform consistency checks. However, we acknowledge that more extensive transfer experiments comparing policies trained in the world model against direct baselines on both simulated and real platforms would provide stronger evidence against compounding errors. We will add these closed-loop transfer results and analyses in the revised manuscript. revision: yes

  2. Referee: [Abstract] Abstract and overall evaluation claims: the manuscript describes the intended extensions and standardized protocol but supplies no concrete results, validation data, error analysis, or quantitative tables. This absence prevents verification that the chosen modality, functionality, and platform extensions actually deliver comprehensive coverage without new biases or gaps.

    Authors: We appreciate this point. The current manuscript emphasizes the design of the three-dimensional extensions and the unified evaluation protocol, supported by illustrative examples rather than exhaustive quantitative benchmarks. We recognize that including concrete validation data, error analyses, and quantitative tables would better demonstrate the coverage and lack of introduced biases. We will expand the evaluation sections with additional results, tables, and analyses in the revision to address this. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark definition is self-contained with no derived predictions or load-bearing self-citations

full rationale

The paper introduces WorldArena 2.0 as an expanded benchmark suite that broadens evaluation along modality, functionality, and platform dimensions under a standardized protocol. No equations, fitted parameters, or predictions are described that could reduce to the paper's own inputs by construction. The central claims concern the creation and application of this independent evaluation testbed rather than deriving results from prior self-citations or ansatzes. The functionality extension to interactive RL environments is presented as a direct assessment protocol without any reduction to fitted quantities or uniqueness theorems imported from the authors' prior work. This is the most common honest finding for benchmark papers that define new evaluation protocols without claiming to derive quantitative results from their own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark relies on domain assumptions about what constitutes fair multimodal and interactive evaluation rather than introducing new fitted parameters or invented entities.

axioms (1)
  • domain assumption A standardized protocol across modalities and platforms can produce comparable and meaningful assessments of world model quality.
    Invoked when claiming the benchmark provides comprehensive tracking of progress.

pith-pipeline@v0.9.0 · 5831 in / 1216 out tokens · 69429 ms · 2026-05-20T10:56:09.906647+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 16 internal anchors

  1. [1]

    Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

    Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

  2. [2]

    A survey of embodied world models

    Yu Shang, Yinzhou Tang, Xin Zhang, Shengyuan Wang, Yuwei Yan, Honglin Zhang, Zhiheng Zheng, Jie Zhao, Jie Feng, Chen Gao, et al. A survey of embodied world models. 2026

  3. [3]

    A survey: Learning embodied intelligence from physical simulators and world models.arXiv preprint arXiv:2507.00917, 2025

    Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, et al. A survey: Learning embodied intelligence from physical simulators and world models.arXiv preprint arXiv:2507.00917, 2025

  4. [4]

    Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

    Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

  5. [5]

    Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

  6. [6]

    Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models.arXiv preprint arXiv:2505.09694, 2025

    Hu Yue, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, and Guanghui Ren. Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models.arXiv preprint arXiv:2505.09694, 2025

  7. [7]

    Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

    Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

  8. [8]

    Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

    Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

  9. [9]

    Wow, wo, val! a comprehensive embodied world model evaluation turing test.arXiv preprint arXiv:2601.04137, 2026

    Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, et al. Wow, wo, val! a comprehensive embodied world model evaluation turing test.arXiv preprint arXiv:2601.04137, 2026

  10. [10]

    Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

    Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

  11. [11]

    World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025

    Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M Patel, Paul Pu Liang, et al. World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025

  12. [12]

    Univtac: A unified simulation platform for visuo-tactile manipulation data generation, learning, and benchmarking.arXiv preprint arXiv:2602.10093, 2026

    Baijun Chen, Weijie Wan, Tianxing Chen, Xianda Guo, Congsheng Xu, Yuanyang Qi, Haojie Zhang, Longyan Wu, Tianling Xu, Zixuan Li, et al. Univtac: A unified simulation platform for visuo-tactile manipulation data generation, learning, and benchmarking.arXiv preprint arXiv:2602.10093, 2026

  13. [13]

    A comprehensive survey on world models for embodied ai.arXiv preprint arXiv:2510.16732, 2025

    Xinqing Li, Xin He, Le Zhang, Min Wu, Xiaoli Li, and Yun Liu. A comprehensive survey on world models for embodied ai.arXiv preprint arXiv:2510.16732, 2025. 10

  14. [14]

    Mastering atari with discrete world models

    Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. InInternational Conference on Learning Representations

  15. [15]

    Mastering diverse control tasks through world models.Nature, pages 1–7, 2025

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, pages 1–7, 2025

  16. [16]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  17. [17]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  18. [18]

    generated vs. non-generated

    Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Bohan Li, Nianchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond.arXiv preprint arXiv:2405.03520, 2024

  19. [19]

    Cosmos world foundation models for physical ai

    Jinwei Gu. Cosmos world foundation models for physical ai. InProceedings of the 3rd International Workshop on Rich Media With Generative AI, pages 39–39, 2025

  20. [20]

    Roboscape: Physics-informed embodied world model

    Yu Shang, Xin Zhang, Yinzhou Tang, Lei Jin, Chen Gao, Wei Wu, and Yong Li. Roboscape: Physics-informed embodied world model.arXiv preprint arXiv:2506.23135, 2025

  21. [21]

    Ctrl-World: A Controllable Generative World Model for Robot Manipulation

    Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

  22. [22]

    Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026

    Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, et al. Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026

  23. [23]

    Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

    Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

  24. [24]

    Vidar: Embodied Video Diffusion Model for Generalist Manipulation

    Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

  25. [25]

    Irasim: Learning interactive real-robot action simulators.arXiv preprint arXiv:2406.14540, 2024

    Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: Learning interactive real-robot action simulators.arXiv preprint arXiv:2406.14540, 2024

  26. [26]

    ivideogpt: Interactive videogpts are scalable world models.Advances in Neural Information Processing Systems, 37:68082–68119, 2024

    Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. ivideogpt: Interactive videogpts are scalable world models.Advances in Neural Information Processing Systems, 37:68082–68119, 2024

  27. [27]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  28. [29]

    Causal World Modeling for Robot Control

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  29. [30]

    Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

  30. [31]

    4dworldbench: A comprehensive evaluation framework for 3d/4d world generation models.arXiv preprint arXiv:2511.19836, 2025

    Yiting Lu, Wei Luo, Peiyan Tu, Haoran Li, Hanxin Zhu, Zihao Yu, Xingrui Wang, Xinyi Chen, Xinge Peng, Xin Li, et al. 4dworldbench: A comprehensive evaluation framework for 3d/4d world generation models.arXiv preprint arXiv:2511.19836, 2025. 11

  31. [32]

    Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

    Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

  32. [33]

    Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

    GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

  33. [34]

    Vtam: Video-tactile-action models for complex physical interaction beyond vlas.arXiv preprint arXiv:2603.23481, 2026

    Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo, Jiashi Yin, Xinzhuo Li, Xiangyu Zeng, Chuan Wen, Cewu Lu, Katherine Driggs-Campbell, and Ismini Lourentzou. Vtam: Video-tactile-action models for complex physical interaction beyond vlas.arXiv preprint arXiv:2603.23481, 2026

  34. [35]

    Visuo-tactile world models.arXiv preprint arXiv:2602.06001, 2026

    Carolina Higuera, Sergio Arnaud, Byron Boots, Mustafa Mukadam, Francois Robert Hogan, and Franziska Meier. Visuo-tactile world models.arXiv preprint arXiv:2602.06001, 2026

  35. [36]

    Omnivta: Visuo- tactile world modeling for contact-rich robotic manipulation.arXiv preprint arXiv:2603.19201, 2026

    Yuhang Zheng, Songen Gu, Weize Li, Yupeng Zheng, Yujie Zang, Shuai Tian, Xiang Li, Ce Hao, Chen Gao, Si Liu, Haoran Li, Yilun Chen, Shuicheng Yan, and Wenchao Ding. Omnivta: Visuo- tactile world modeling for contact-rich robotic manipulation.arXiv preprint arXiv:2603.19201, 2026

  36. [37]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  37. [38]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

  38. [39]

    World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

    Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-env: Leveraging world model as a virtual environment for vla post-training.arXiv preprint arXiv:2509.24948, 2025

  39. [40]

    Roboscape-r: Unified reward-observation world models for generalizable robotics training via rl.arXiv preprint arXiv:2512.03556, 2025

    Yinzhou Tang, Yu Shang, Yinuo Chen, Bingwen Wei, Xin Zhang, Shu’ang Yu, Liangzhi Shi, Chao Yu, Chen Gao, Wei Wu, et al. Roboscape-r: Unified reward-observation world models for generalizable robotics training via rl.arXiv preprint arXiv:2512.03556, 2025

  40. [41]

    WMPO: world model- based policy optimization for vision-language-action models.arXiv preprint arXiv:2511.09515, 2025

    Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, and Song Guo. Wmpo: World model-based policy optimization for vision-language-action models.arXiv preprint arXiv:2511.09515, 2025

  41. [42]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), Daegu, Republic of Korea, July 2023

  42. [43]

    Jiang, S

    Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, et al. Wovr: World models as reliable simulators for post-training vla policies with rl.arXiv preprint arXiv:2602.13977, 2026

  43. [44]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024

  44. [45]

    Irasim: A fine-grained world model for robot manipulation

    Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: A fine-grained world model for robot manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9834–9844, 2025. 12

  45. [46]

    World Simulation with Video Foundation Models for Physical AI

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

  46. [47]

    Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

  47. [48]

    Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025

    Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, et al. Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025

  48. [49]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  49. [50]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

  50. [51]

    PlayWorld: Learning Robot World Models from Autonomous Play

    Tenny Yin, Zhiting Mei, Zhonghe Zheng, Miyu Yamane, David Wang, Jade Sceats, Samuel M Bateman, Lihan Zha, Apurva Badithela, Ola Shorinwa, et al. Playworld: Learning robot world models from autonomous play.arXiv preprint arXiv:2603.09030, 2026

  51. [52]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 13 A Platform Introduction RoboTwin 2.0is a scalable bimanual simulation environment comprising 731 objects ac...