WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

Chen Gao; Dhruv Shah; Gordon Wetzstein; Haisheng Su; Haoyi Duan; Jun Zhu; Lei Jin; Tat-Seng Chua; Weikang Su; Wei Wu

arxiv: 2605.17912 · v1 · pith:F7VFIDNEnew · submitted 2026-05-18 · 💻 cs.RO · cs.CV

WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

Yu Shang , Yinzhou Tang , Yiding Ma , Zhuohang Li , Lei Jin , Weikang Su , Xin Jin , Zhaolu Wang

show 17 more authors

Ziyou Wang Xin Zhang Haisheng Su Weizhen He Wei Wu Haoyi Duan Gordon Wetzstein Xihui Liu Dhruv Shah Zhaoxiang Zhang Zhibo Chen Jun Zhu Yonghong Tian Tat-Seng Chua Wenwu Zhu Chen Gao Yong Li

This is my paper

Pith reviewed 2026-05-20 10:56 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords embodied world modelsbenchmarkingvisuotactile perceptionreinforcement learning environmentsrobotic platformsmultimodal evaluationcross-platform performance

0 comments

The pith

WorldArena 2.0 broadens embodied world model testing to touch sensing, interactive policy training, and real robot platforms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WorldArena 2.0 as an expanded benchmark for world models that let agents predict action outcomes and environmental dynamics. Prior benchmarks stayed limited to vision-only forecasts, offline tasks, and pure simulation, leaving gaps as models grow more capable. The update adds tactile input alongside vision, turns models into active training environments for reinforcement learning policies, and shifts evaluation onto both simulated and physical robots with varied bodies. A single protocol then measures how well models perceive, support interaction, and transfer across these settings. This creates a shared way to monitor advances in embodied intelligence.

Core claim

WorldArena 2.0 extends embodied world model evaluation along three axes: modality from vision-only to visuotactile, functionality from policy evaluation and planning to use as interactive RL environments for policy optimization, and platform from simulator-only to a suite of simulated and real-world robotic settings across multiple embodiments, all under a standardized protocol that assesses perceptual quality, interactive utility, and cross-platform performance.

What carries the argument

The WorldArena 2.0 benchmark, which broadens evaluation along modality, functionality, and platform dimensions under one standardized protocol.

If this is right

World models become testable for their ability to incorporate tactile signals into future predictions.
The same model can now be evaluated both as a predictor and as a live environment that improves robot policies through trial-and-error interaction.
Performance gaps between simulation and physical robots can be quantified directly for each model.
Progress in embodied world models can be tracked consistently across vision, touch, planning, and real hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A shared benchmark of this form could reduce duplication of effort when different research groups test new world-model architectures.
Extending the protocol later to include audio or proprioception would follow the same logic already used for touch.
Models that rank high here may still require separate checks for long-horizon safety before deployment on physical systems.

Load-bearing premise

The chosen additions for modality, functionality, and platform together with the fixed protocol are enough to judge increasingly capable world models without adding new evaluation biases or missing important gaps.

What would settle it

If top-scoring models on WorldArena 2.0 show no corresponding gains in real-world task success rates outside the benchmark suite, the claim that it provides a sufficient testbed would be weakened.

Figures

Figures reproduced from arXiv: 2605.17912 by Chen Gao, Dhruv Shah, Gordon Wetzstein, Haisheng Su, Haoyi Duan, Jun Zhu, Lei Jin, Tat-Seng Chua, Weikang Su, Wei Wu, Weizhen He, Wenwu Zhu, Xihui Liu, Xin Jin, Xin Zhang, Yiding Ma, Yinzhou Tang, Yonghong Tian, Yong Li, Yu Shang, Zhaolu Wang, Zhaoxiang Zhang, Zhibo Chen, Zhuohang Li, Ziyou Wang.

**Figure 2.** Figure 2: The standardized visuotactile world model architecture design (a) and the visuotactile [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: WorldArena 2.0 framework for leveraging world models as RL environments, comprising [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of three test platforms in WorldArena 2.0: RoboTwin 2.0, LIBERO, and a [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Relationship between environment training steps and the policy success rate in Click Bell [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Cross-platform task success rate correlation between RoboTwin, LIBERO and the real [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-platform video quality correlation between RoboTwin and LIBERO. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Cross-platform video quality correlation between RoboTwin and the real-world robotic [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Cross-platform video quality correlation between LIBERO and the real-world robotic data. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WorldArena 2.0 extends embodied world model benchmarking to visuotactile, interactive RL, and real platforms, but the interactive part needs transfer validation to be convincing.

read the letter

WorldArena 2.0 adds three things to the earlier benchmark: visuotactile modalities for perception, the ability to use the model as an interactive RL environment for policy optimization, and testing across simulated and real robotic platforms. The work does well in laying out these extensions explicitly and providing a standardized evaluation protocol. Making the benchmark available online at world-arena.ai is a practical move that could help standardize comparisons in the field. The soft spot is the functionality dimension. The claim that this assesses interactive utility rests on world models serving as RL environments. That only works if policies trained in the model are transferred back to the platforms and evaluated there against direct baselines. The abstract gives no indication that such transfer experiments or comparisons are included. If they are absent, then success rates inside the model could look better than they are because model inaccuracies grow under closed-loop control. The modality and platform parts look more direct, though real-world testing always brings its own challenges with reproducibility. This paper is for robotics researchers focused on world models and embodied intelligence. A reader interested in standardized benchmarks for multimodal and interactive capabilities would get value from the protocol and the released suite. It deserves a serious referee to examine the actual results, any error analysis, and whether the protocol closes the gaps mentioned. I would recommend sending it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper introduces WorldArena 2.0 as an expanded benchmark for embodied world models. It extends evaluation along three axes: modality (adding visuotactile to vision-only), functionality (adding use as interactive RL environments for policy optimization beyond planning and offline evaluation), and platform (adding diverse simulated and real robotic embodiments beyond simulators). A standardized protocol is proposed to assess perceptual quality, interactive utility, and cross-platform performance, with the benchmark released at https://world-arena.ai.

Significance. If the extensions are accompanied by concrete validation data and closed-loop transfer results, the benchmark could serve as a useful standardized testbed for tracking progress on more comprehensive embodied world models, filling gaps left by existing vision-only, offline, and simulator-centric suites.

major comments (2)

[Functionality dimension] Functionality dimension (as described in the abstract and § on extensions): the claim that world models can be assessed as interactive RL environments requires closed-loop policy transfer experiments. Policies optimized inside the world model must be transferred to the target simulated or real platforms and compared against direct baselines; without such results, in-model success rates risk inflation from compounding inaccuracies and do not substantiate the interactive utility dimension.
[Abstract] Abstract and overall evaluation claims: the manuscript describes the intended extensions and standardized protocol but supplies no concrete results, validation data, error analysis, or quantitative tables. This absence prevents verification that the chosen modality, functionality, and platform extensions actually deliver comprehensive coverage without new biases or gaps.

minor comments (1)

[Benchmark release] The benchmark availability statement could include explicit details on protocol documentation, dataset access, and reproducibility instructions beyond the URL.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing WorldArena 2.0. We address the major comments point by point below, agreeing where revisions are needed to strengthen the validation of the benchmark extensions.

read point-by-point responses

Referee: [Functionality dimension] Functionality dimension (as described in the abstract and § on extensions): the claim that world models can be assessed as interactive RL environments requires closed-loop policy transfer experiments. Policies optimized inside the world model must be transferred to the target simulated or real platforms and compared against direct baselines; without such results, in-model success rates risk inflation from compounding inaccuracies and do not substantiate the interactive utility dimension.

Authors: We agree that demonstrating closed-loop policy transfer is important for fully substantiating the interactive utility dimension. The manuscript currently focuses on establishing the standardized protocol for using world models as RL environments and includes preliminary in-model optimization results along with some cross-platform consistency checks. However, we acknowledge that more extensive transfer experiments comparing policies trained in the world model against direct baselines on both simulated and real platforms would provide stronger evidence against compounding errors. We will add these closed-loop transfer results and analyses in the revised manuscript. revision: yes
Referee: [Abstract] Abstract and overall evaluation claims: the manuscript describes the intended extensions and standardized protocol but supplies no concrete results, validation data, error analysis, or quantitative tables. This absence prevents verification that the chosen modality, functionality, and platform extensions actually deliver comprehensive coverage without new biases or gaps.

Authors: We appreciate this point. The current manuscript emphasizes the design of the three-dimensional extensions and the unified evaluation protocol, supported by illustrative examples rather than exhaustive quantitative benchmarks. We recognize that including concrete validation data, error analyses, and quantitative tables would better demonstrate the coverage and lack of introduced biases. We will expand the evaluation sections with additional results, tables, and analyses in the revision to address this. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark definition is self-contained with no derived predictions or load-bearing self-citations

full rationale

The paper introduces WorldArena 2.0 as an expanded benchmark suite that broadens evaluation along modality, functionality, and platform dimensions under a standardized protocol. No equations, fitted parameters, or predictions are described that could reduce to the paper's own inputs by construction. The central claims concern the creation and application of this independent evaluation testbed rather than deriving results from prior self-citations or ansatzes. The functionality extension to interactive RL environments is presented as a direct assessment protocol without any reduction to fitted quantities or uniqueness theorems imported from the authors' prior work. This is the most common honest finding for benchmark papers that define new evaluation protocols without claiming to derive quantitative results from their own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark relies on domain assumptions about what constitutes fair multimodal and interactive evaluation rather than introducing new fitted parameters or invented entities.

axioms (1)

domain assumption A standardized protocol across modalities and platforms can produce comparable and meaningful assessments of world model quality.
Invoked when claiming the benchmark provides comprehensive tracking of progress.

pith-pipeline@v0.9.0 · 5831 in / 1216 out tokens · 69429 ms · 2026-05-20T10:56:09.906647+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 16 internal anchors

[1]

Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

work page 2025
[2]

A survey of embodied world models

Yu Shang, Yinzhou Tang, Xin Zhang, Shengyuan Wang, Yuwei Yan, Honglin Zhang, Zhiheng Zheng, Jie Zhao, Jie Feng, Chen Gao, et al. A survey of embodied world models. 2026

work page 2026
[3]

A survey: Learning embodied intelligence from physical simulators and world models.arXiv preprint arXiv:2507.00917, 2025

Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, et al. A survey: Learning embodied intelligence from physical simulators and world models.arXiv preprint arXiv:2507.00917, 2025

work page arXiv 2025
[4]

Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

work page arXiv 2025
[5]

Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

work page arXiv 2025
[6]

Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models.arXiv preprint arXiv:2505.09694, 2025

Hu Yue, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, and Guanghui Ren. Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models.arXiv preprint arXiv:2505.09694, 2025

work page arXiv 2025
[7]

Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

work page arXiv 2024
[8]

Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

work page arXiv 2025
[9]

Wow, wo, val! a comprehensive embodied world model evaluation turing test.arXiv preprint arXiv:2601.04137, 2026

Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, et al. Wow, wo, val! a comprehensive embodied world model evaluation turing test.arXiv preprint arXiv:2601.04137, 2026

work page arXiv 2026
[10]

Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

work page arXiv 2026
[11]

World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025

Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M Patel, Paul Pu Liang, et al. World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025

work page arXiv 2025
[12]

Univtac: A unified simulation platform for visuo-tactile manipulation data generation, learning, and benchmarking.arXiv preprint arXiv:2602.10093, 2026

Baijun Chen, Weijie Wan, Tianxing Chen, Xianda Guo, Congsheng Xu, Yuanyang Qi, Haojie Zhang, Longyan Wu, Tianling Xu, Zixuan Li, et al. Univtac: A unified simulation platform for visuo-tactile manipulation data generation, learning, and benchmarking.arXiv preprint arXiv:2602.10093, 2026

work page arXiv 2026
[13]

A comprehensive survey on world models for embodied ai.arXiv preprint arXiv:2510.16732, 2025

Xinqing Li, Xin He, Le Zhang, Min Wu, Xiaoli Li, and Yun Liu. A comprehensive survey on world models for embodied ai.arXiv preprint arXiv:2510.16732, 2025. 10

work page arXiv 2025
[14]

Mastering atari with discrete world models

Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. InInternational Conference on Learning Representations

work page
[15]

Mastering diverse control tasks through world models.Nature, pages 1–7, 2025

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, pages 1–7, 2025

work page 2025
[16]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

generated vs. non-generated

Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Bohan Li, Nianchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond.arXiv preprint arXiv:2405.03520, 2024

work page arXiv 2024
[19]

Cosmos world foundation models for physical ai

Jinwei Gu. Cosmos world foundation models for physical ai. InProceedings of the 3rd International Workshop on Rich Media With Generative AI, pages 39–39, 2025

work page 2025
[20]

Roboscape: Physics-informed embodied world model

Yu Shang, Xin Zhang, Yinzhou Tang, Lei Jin, Chen Gao, Wei Wu, and Yong Li. Roboscape: Physics-informed embodied world model.arXiv preprint arXiv:2506.23135, 2025

work page arXiv 2025
[21]

Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026

Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, et al. Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026

work page arXiv 2026
[23]

Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

work page arXiv 2025
[24]

Vidar: Embodied Video Diffusion Model for Generalist Manipulation

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Irasim: Learning interactive real-robot action simulators.arXiv preprint arXiv:2406.14540, 2024

Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: Learning interactive real-robot action simulators.arXiv preprint arXiv:2406.14540, 2024

work page arXiv 2024
[26]

ivideogpt: Interactive videogpts are scalable world models.Advances in Neural Information Processing Systems, 37:68082–68119, 2024

Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. ivideogpt: Interactive videogpts are scalable world models.Advances in Neural Information Processing Systems, 37:68082–68119, 2024

work page 2024
[27]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

4dworldbench: A comprehensive evaluation framework for 3d/4d world generation models.arXiv preprint arXiv:2511.19836, 2025

Yiting Lu, Wei Luo, Peiyan Tu, Haoran Li, Hanxin Zhu, Zihao Yu, Xingrui Wang, Xinyi Chen, Xinge Peng, Xin Li, et al. 4dworldbench: A comprehensive evaluation framework for 3d/4d world generation models.arXiv preprint arXiv:2511.19836, 2025. 11

work page arXiv 2025
[32]

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

work page arXiv 2025
[34]

Vtam: Video-tactile-action models for complex physical interaction beyond vlas.arXiv preprint arXiv:2603.23481, 2026

Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo, Jiashi Yin, Xinzhuo Li, Xiangyu Zeng, Chuan Wen, Cewu Lu, Katherine Driggs-Campbell, and Ismini Lourentzou. Vtam: Video-tactile-action models for complex physical interaction beyond vlas.arXiv preprint arXiv:2603.23481, 2026

work page arXiv 2026
[35]

Visuo-tactile world models.arXiv preprint arXiv:2602.06001, 2026

Carolina Higuera, Sergio Arnaud, Byron Boots, Mustafa Mukadam, Francois Robert Hogan, and Franziska Meier. Visuo-tactile world models.arXiv preprint arXiv:2602.06001, 2026

work page arXiv 2026
[36]

Omnivta: Visuo- tactile world modeling for contact-rich robotic manipulation.arXiv preprint arXiv:2603.19201, 2026

Yuhang Zheng, Songen Gu, Weize Li, Yupeng Zheng, Yujie Zang, Shuai Tian, Xiang Li, Ce Hao, Chen Gao, Si Liu, Haoran Li, Yilun Chen, Shuicheng Yan, and Wenchao Ding. Omnivta: Visuo- tactile world modeling for contact-rich robotic manipulation.arXiv preprint arXiv:2603.19201, 2026

work page arXiv 2026
[37]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024
[38]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-env: Leveraging world model as a virtual environment for vla post-training.arXiv preprint arXiv:2509.24948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Roboscape-r: Unified reward-observation world models for generalizable robotics training via rl.arXiv preprint arXiv:2512.03556, 2025

Yinzhou Tang, Yu Shang, Yinuo Chen, Bingwen Wei, Xin Zhang, Shu’ang Yu, Liangzhi Shi, Chao Yu, Chen Gao, Wei Wu, et al. Roboscape-r: Unified reward-observation world models for generalizable robotics training via rl.arXiv preprint arXiv:2512.03556, 2025

work page arXiv 2025
[41]

WMPO: world model- based policy optimization for vision-language-action models.arXiv preprint arXiv:2511.09515, 2025

Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, and Song Guo. Wmpo: World model-based policy optimization for vision-language-action models.arXiv preprint arXiv:2511.09515, 2025

work page arXiv 2025
[42]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), Daegu, Republic of Korea, July 2023

work page 2023
[43]

Jiang, S

Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, et al. Wovr: World models as reliable simulators for post-training vla policies with rl.arXiv preprint arXiv:2602.13977, 2026

work page arXiv 2026
[44]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Irasim: A fine-grained world model for robot manipulation

Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: A fine-grained world model for robot manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9834–9844, 2025. 12

work page 2025
[46]

World Simulation with Video Foundation Models for Physical AI

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025

Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, et al. Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025

work page arXiv 2025
[49]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

work page 2025
[51]

PlayWorld: Learning Robot World Models from Autonomous Play

Tenny Yin, Zhiting Mei, Zhonghe Zheng, Miyu Yamane, David Wang, Jade Sceats, Samuel M Bateman, Lihan Zha, Apurva Badithela, Ola Shorinwa, et al. Playworld: Learning robot world models from autonomous play.arXiv preprint arXiv:2603.09030, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 13 A Platform Introduction RoboTwin 2.0is a scalable bimanual simulation environment comprising 731 objects ac...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

work page 2025

[2] [2]

A survey of embodied world models

Yu Shang, Yinzhou Tang, Xin Zhang, Shengyuan Wang, Yuwei Yan, Honglin Zhang, Zhiheng Zheng, Jie Zhao, Jie Feng, Chen Gao, et al. A survey of embodied world models. 2026

work page 2026

[3] [3]

A survey: Learning embodied intelligence from physical simulators and world models.arXiv preprint arXiv:2507.00917, 2025

Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, et al. A survey: Learning embodied intelligence from physical simulators and world models.arXiv preprint arXiv:2507.00917, 2025

work page arXiv 2025

[4] [4]

Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

work page arXiv 2025

[5] [5]

Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

work page arXiv 2025

[6] [6]

Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models.arXiv preprint arXiv:2505.09694, 2025

Hu Yue, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, and Guanghui Ren. Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models.arXiv preprint arXiv:2505.09694, 2025

work page arXiv 2025

[7] [7]

Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

work page arXiv 2024

[8] [8]

Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

work page arXiv 2025

[9] [9]

Wow, wo, val! a comprehensive embodied world model evaluation turing test.arXiv preprint arXiv:2601.04137, 2026

Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, et al. Wow, wo, val! a comprehensive embodied world model evaluation turing test.arXiv preprint arXiv:2601.04137, 2026

work page arXiv 2026

[10] [10]

Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

work page arXiv 2026

[11] [11]

World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025

Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M Patel, Paul Pu Liang, et al. World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025

work page arXiv 2025

[12] [12]

Univtac: A unified simulation platform for visuo-tactile manipulation data generation, learning, and benchmarking.arXiv preprint arXiv:2602.10093, 2026

Baijun Chen, Weijie Wan, Tianxing Chen, Xianda Guo, Congsheng Xu, Yuanyang Qi, Haojie Zhang, Longyan Wu, Tianling Xu, Zixuan Li, et al. Univtac: A unified simulation platform for visuo-tactile manipulation data generation, learning, and benchmarking.arXiv preprint arXiv:2602.10093, 2026

work page arXiv 2026

[13] [13]

A comprehensive survey on world models for embodied ai.arXiv preprint arXiv:2510.16732, 2025

Xinqing Li, Xin He, Le Zhang, Min Wu, Xiaoli Li, and Yun Liu. A comprehensive survey on world models for embodied ai.arXiv preprint arXiv:2510.16732, 2025. 10

work page arXiv 2025

[14] [14]

Mastering atari with discrete world models

Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. InInternational Conference on Learning Representations

work page

[15] [15]

Mastering diverse control tasks through world models.Nature, pages 1–7, 2025

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, pages 1–7, 2025

work page 2025

[16] [16]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

generated vs. non-generated

Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Bohan Li, Nianchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond.arXiv preprint arXiv:2405.03520, 2024

work page arXiv 2024

[19] [19]

Cosmos world foundation models for physical ai

Jinwei Gu. Cosmos world foundation models for physical ai. InProceedings of the 3rd International Workshop on Rich Media With Generative AI, pages 39–39, 2025

work page 2025

[20] [20]

Roboscape: Physics-informed embodied world model

Yu Shang, Xin Zhang, Yinzhou Tang, Lei Jin, Chen Gao, Wei Wu, and Yong Li. Roboscape: Physics-informed embodied world model.arXiv preprint arXiv:2506.23135, 2025

work page arXiv 2025

[21] [21]

Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026

Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, et al. Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026

work page arXiv 2026

[23] [23]

Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

work page arXiv 2025

[24] [24]

Vidar: Embodied Video Diffusion Model for Generalist Manipulation

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Irasim: Learning interactive real-robot action simulators.arXiv preprint arXiv:2406.14540, 2024

Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: Learning interactive real-robot action simulators.arXiv preprint arXiv:2406.14540, 2024

work page arXiv 2024

[26] [26]

ivideogpt: Interactive videogpts are scalable world models.Advances in Neural Information Processing Systems, 37:68082–68119, 2024

Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. ivideogpt: Interactive videogpts are scalable world models.Advances in Neural Information Processing Systems, 37:68082–68119, 2024

work page 2024

[27] [27]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [29]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [30]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [31]

4dworldbench: A comprehensive evaluation framework for 3d/4d world generation models.arXiv preprint arXiv:2511.19836, 2025

Yiting Lu, Wei Luo, Peiyan Tu, Haoran Li, Hanxin Zhu, Zihao Yu, Xingrui Wang, Xinyi Chen, Xinge Peng, Xin Li, et al. 4dworldbench: A comprehensive evaluation framework for 3d/4d world generation models.arXiv preprint arXiv:2511.19836, 2025. 11

work page arXiv 2025

[31] [32]

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [33]

Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025

work page arXiv 2025

[33] [34]

Vtam: Video-tactile-action models for complex physical interaction beyond vlas.arXiv preprint arXiv:2603.23481, 2026

Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo, Jiashi Yin, Xinzhuo Li, Xiangyu Zeng, Chuan Wen, Cewu Lu, Katherine Driggs-Campbell, and Ismini Lourentzou. Vtam: Video-tactile-action models for complex physical interaction beyond vlas.arXiv preprint arXiv:2603.23481, 2026

work page arXiv 2026

[34] [35]

Visuo-tactile world models.arXiv preprint arXiv:2602.06001, 2026

Carolina Higuera, Sergio Arnaud, Byron Boots, Mustafa Mukadam, Francois Robert Hogan, and Franziska Meier. Visuo-tactile world models.arXiv preprint arXiv:2602.06001, 2026

work page arXiv 2026

[35] [36]

Omnivta: Visuo- tactile world modeling for contact-rich robotic manipulation.arXiv preprint arXiv:2603.19201, 2026

Yuhang Zheng, Songen Gu, Weize Li, Yupeng Zheng, Yujie Zang, Shuai Tian, Xiang Li, Ce Hao, Chen Gao, Si Liu, Haoran Li, Yilun Chen, Shuicheng Yan, and Wenchao Ding. Omnivta: Visuo- tactile world modeling for contact-rich robotic manipulation.arXiv preprint arXiv:2603.19201, 2026

work page arXiv 2026

[36] [37]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024

[37] [38]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [39]

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-env: Leveraging world model as a virtual environment for vla post-training.arXiv preprint arXiv:2509.24948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [40]

Roboscape-r: Unified reward-observation world models for generalizable robotics training via rl.arXiv preprint arXiv:2512.03556, 2025

Yinzhou Tang, Yu Shang, Yinuo Chen, Bingwen Wei, Xin Zhang, Shu’ang Yu, Liangzhi Shi, Chao Yu, Chen Gao, Wei Wu, et al. Roboscape-r: Unified reward-observation world models for generalizable robotics training via rl.arXiv preprint arXiv:2512.03556, 2025

work page arXiv 2025

[40] [41]

WMPO: world model- based policy optimization for vision-language-action models.arXiv preprint arXiv:2511.09515, 2025

Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, and Song Guo. Wmpo: World model-based policy optimization for vision-language-action models.arXiv preprint arXiv:2511.09515, 2025

work page arXiv 2025

[41] [42]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), Daegu, Republic of Korea, July 2023

work page 2023

[42] [43]

Jiang, S

Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, et al. Wovr: World models as reliable simulators for post-training vla policies with rl.arXiv preprint arXiv:2602.13977, 2026

work page arXiv 2026

[43] [44]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [45]

Irasim: A fine-grained world model for robot manipulation

Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: A fine-grained world model for robot manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9834–9844, 2025. 12

work page 2025

[45] [46]

World Simulation with Video Foundation Models for Physical AI

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [47]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[47] [48]

Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025

Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, et al. Rlinf-vla: A unified and efficient framework for vla+ rl training.arXiv preprint arXiv:2510.06710, 2025

work page arXiv 2025

[48] [49]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [50]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

work page 2025

[50] [51]

PlayWorld: Learning Robot World Models from Autonomous Play

Tenny Yin, Zhiting Mei, Zhonghe Zheng, Miyu Yamane, David Wang, Jade Sceats, Samuel M Bateman, Lihan Zha, Apurva Badithela, Ola Shorinwa, et al. Playworld: Learning robot world models from autonomous play.arXiv preprint arXiv:2603.09030, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [52]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 13 A Platform Introduction RoboTwin 2.0is a scalable bimanual simulation environment comprising 731 objects ac...

work page internal anchor Pith review Pith/arXiv arXiv 2024