WorldOlympiad: Can Your World Model Survive a Triathlon?

Akide Liu; Bohan Zhuang; Dakai An; Fan Wang; Jiasheng Tang; Wangbo Zhao; Weijie Wang; Wei Wang; Yinghao Yu; Yuke Zhao

arxiv: 2606.11129 · v2 · pith:YZIX6QSEnew · submitted 2026-06-09 · 💻 cs.CV

WorldOlympiad: Can Your World Model Survive a Triathlon?

Yuke Zhao , Wangbo Zhao , Weijie Wang , Zeyu Zhang , Dakai An , Akide Liu , Yinghao Yu , Jiasheng Tang

show 3 more authors

Fan Wang Wei Wang Bohan Zhuang

This is my paper

Pith reviewed 2026-06-27 13:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords world modelsvideo generationbenchmarkphysical reasoninggeometric consistencyinteraction fidelitygenerative evaluation

0 comments

The pith

WorldOlympiad is a benchmark that splits world-model evaluation into physical rules, 3D geometry, and long-horizon interaction tracks to expose failures missed by visual-quality tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WorldOlympiad as a three-track benchmark that measures whether generated videos obey mechanics and material rules, maintain consistent 3D structure across views, and follow complex action sequences over extended rollouts. Existing tests focus on appearance or short clips and therefore miss whether models have built usable internal simulations of the world. By applying the benchmark to current models across gaming, robotics, and open-world video, the work shows clear shortfalls in all three areas and argues that future progress requires evaluation protocols built around these specific properties rather than generic video metrics.

Core claim

WorldOlympiad decomposes evaluation into a physical track that uses segmentation and MLLM judgments on mechanics, heat, and materials; a geometry track that reconstructs scenes with Gaussian splatting to check cross-view and camera consistency; and an interaction track that scores adherence to action prompts across consecutive video segments. The benchmark spans gaming, robotics, and real-world scenarios. Experiments on state-of-the-art models find substantial gaps in physical faithfulness, geometric coherence, and sustained interaction control.

What carries the argument

WorldOlympiad benchmark with its physical track (object segmentation plus MLLM judge), geometry track (Gaussian splatting reconstruction), and interaction track (action-prompt following across chunks).

If this is right

World-model development should optimize directly for the three measured properties rather than proxy metrics such as visual fidelity alone.
Evaluation protocols for generative models will need to include explicit physical, geometric, and long-horizon interaction tests to be considered complete.
Downstream applications in robotics and gaming will continue to show control failures until models close the gaps identified on the interaction track.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A model that passes the geometry track may still fail downstream 3D tasks if the Gaussian reconstruction only captures surface appearance rather than underlying structure.
The benchmark could be extended by adding quantitative physics simulators as ground truth to reduce reliance on MLLM judgment.
If the three tracks prove additive, training objectives that jointly optimize all three could produce more general world models than single-objective approaches.

Load-bearing premise

The measurements produced by MLLM judges and Gaussian splatting reconstructions accurately reflect the intended physical and geometric properties without introducing their own systematic biases.

What would settle it

A follow-up study that applies the same three tracks to the same models and obtains high scores across all dimensions with no additional training would falsify the reported gaps.

read the original abstract

We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short-term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world-model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM-as-judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real-world videos, capturing diverse challenges from interactive control and embodied manipulation to open-domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state-of-the-art models reveal substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WorldOlympiad proposes a three-track benchmark for physical, geometric, and interaction fidelity in video world models, but the abstract supplies no metrics, validation, or results to back the claimed gaps.

read the letter

The core idea is a benchmark that splits world-model testing into physical faithfulness via segmentation and MLLM judgment, geometric consistency via Gaussian splatting reconstruction, and interaction fidelity via action-prompt following across gaming, robotics, and real-world clips. This decomposition targets failure modes that standard visual-quality metrics overlook, and the triathlon framing is a clean way to organize the evaluation.

The paper does a reasonable job naming the limitations of existing benchmarks and sketching how each track could expose specific weaknesses. The downstream scenarios are sensible choices for robotics and simulation users.

The main weakness is the complete absence of any numbers, implementation details, or calibration. The abstract states that experiments on SOTA models show substantial gaps, yet it reports nothing on how the MLLM prompts were written, whether they were checked against human labels, or how Gaussian splatting handles dynamic scenes. The stress-test point about unvalidated biases in both proxies is therefore hard to dismiss; without that evidence the reported gaps remain unconvincing. No error analysis or comparison to simpler baselines appears either.

This is aimed at groups already working on generative world models who need more diagnostic tools than FID or short-clip coherence scores. A reader focused on embodied AI or simulation might borrow the track structure even if they redesign the actual scorers.

It should go to peer review so the methods and any actual data can be examined; the benchmark concept is worth developing but currently rests on unshown execution.

Referee Report

3 major / 2 minor

Summary. The paper introduces WorldOlympiad, a benchmark decomposing world-model evaluation into physical faithfulness (object segmentation + MLLM-as-judge on mechanics/thermal/material rules), geometric consistency (Gaussian splatting reconstruction for structural/cross-view/camera alignment), and interaction fidelity (action-prompt following and chunk transitions). It targets three scenarios (gaming, robotics, real-world videos) and claims that experiments on SOTA models expose substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, motivating more structured protocols beyond visual or semantic quality metrics.

Significance. If the proposed tracks prove reliable after validation, the benchmark could usefully expose failure modes in generative world models that current short-term coherence or visual-quality tests miss. The decomposition into interpretable dimensions and coverage of downstream scenarios is a constructive step, but the absence of any reported metrics, calibration data, or error analysis in the manuscript limits its current contribution.

major comments (3)

[Physical track] Physical track description (abstract and methods): the MLLM-as-judge proxy for mechanics, thermal, and material rules is presented without any calibration against human labels or synthetic ground-truth violations; this is load-bearing because the central claim of 'substantial gaps in physical reasoning' rests directly on these scores.
[Geometry track] Geometry track description (abstract and methods): Gaussian splatting reconstruction is used to measure structural/cross-view/camera consistency with no reported analysis of its known failure modes on dynamic scenes or low-coherence video; this directly affects the reliability of the '3D consistency' gap claim.
[Experiments] Experiments section (abstract): the manuscript states that 'experiments on state-of-the-art models reveal substantial gaps' yet supplies no quantitative metrics, model identifiers, error bars, or comparison tables, preventing assessment of whether the observed differences exceed metric noise.

minor comments (2)

[Abstract] The abstract and introduction would benefit from explicit definitions of the three downstream scenarios and how the tracks map onto them.
[Introduction] Notation for the three tracks is introduced without a summary table; adding one would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas for strengthening the reliability of the proposed benchmark. We agree that additional validation and reporting are needed to support the claims regarding gaps in world models. The revised manuscript will incorporate calibration studies, failure-mode analyses, and expanded experimental details as outlined below.

read point-by-point responses

Referee: [Physical track] Physical track description (abstract and methods): the MLLM-as-judge proxy for mechanics, thermal, and material rules is presented without any calibration against human labels or synthetic ground-truth violations; this is load-bearing because the central claim of 'substantial gaps in physical reasoning' rests directly on these scores.

Authors: We agree that calibration is necessary for the MLLM-as-judge to robustly support claims of physical reasoning gaps. The manuscript presents the MLLM judge as a scalable proxy drawing on prior VLM evaluation practices, but we will revise to add a validation subsection. This will report agreement with human labels on a held-out set of 200 clips (with inter-annotator agreement) and performance on synthetic videos containing controlled rule violations (e.g., inverted gravity or melting objects). Results will appear in the methods and experiments sections of the revision. revision: yes
Referee: [Geometry track] Geometry track description (abstract and methods): Gaussian splatting reconstruction is used to measure structural/cross-view/camera consistency with no reported analysis of its known failure modes on dynamic scenes or low-coherence video; this directly affects the reliability of the '3D consistency' gap claim.

Authors: We acknowledge that Gaussian splatting can exhibit reconstruction artifacts on dynamic or low-coherence videos, potentially affecting geometry track reliability. The revision will add an explicit analysis of these failure modes, including per-video reconstruction quality metrics (e.g., novel-view PSNR) stratified by scene dynamics, discussion of impact on consistency scores, and mitigation approaches such as quality thresholding or supplementary 2D metrics. These will be reported in the geometry track description and experiments. revision: yes
Referee: [Experiments] Experiments section (abstract): the manuscript states that 'experiments on state-of-the-art models reveal substantial gaps' yet supplies no quantitative metrics, model identifiers, error bars, or comparison tables, preventing assessment of whether the observed differences exceed metric noise.

Authors: The full manuscript contains experimental results across models and scenarios, but we agree the presentation requires greater transparency to evaluate the gap claims. The revision will expand the experiments section with explicit model identifiers and versions, complete comparison tables, error bars (standard deviation over multiple seeds or generations), and statistical significance tests. This will enable readers to assess whether differences exceed metric variability. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark is an external evaluation protocol

full rationale

The paper defines three evaluation tracks (physical via segmentation+MLLM judge, geometry via Gaussian splatting, interaction via action prompts) and applies them empirically to SOTA models. No equations, fitted parameters, self-citations, or derivations are used to generate the reported gaps; the results are direct measurements on held-out generated videos. This is a standard benchmark paper with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; evaluation relies on external components (MLLM judges, Gaussian splatting) whose internal assumptions are unstated.

axioms (2)

domain assumption MLLM-as-judge accurately assesses physical faithfulness without bias
Invoked for the physical track evaluation.
domain assumption Gaussian splatting reconstruction yields reliable measures of structural consistency
Invoked for the geometry track.

pith-pipeline@v0.9.1-grok · 5805 in / 1344 out tokens · 31630 ms · 2026-06-27T13:03:50.157202+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 24 linked inside Pith

[1]

Cosmos world foundation model platform for physical ai

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025
[2]

World simulation with video foundation models for physical ai

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062, 2025

Pith/arXiv arXiv 2025
[3]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023
[4]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

2024
[5]

Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Pith/arXiv arXiv 2025
[6]

Gamegen-x: Interactive open-world game video generation

Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation. InInternational Conference on Learning Representations, volume 2025, pages 37546–37593, 2025

2025
[7]

Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

arXiv 2025
[8]

Gemini 3 pro model card, 2025

Google DeepMind. Gemini 3 pro model card, 2025

2025
[9]

Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

arXiv 2026
[10]

Worldscore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025

2025
[11]

Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Pith/arXiv arXiv 2025
[12]

Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models

Yue Hu, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, and Guanghui Ren. Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models. arXiv preprint arXiv:2505.09694, 2025

arXiv 2025
[13]

Memory forcing: Spatio-temporal memory for consistent scene generation on minecraft.arXiv preprint arXiv:2510.03198, 2025

Junchao Huang, Xinting Hu, Boyao Han, Shaoshuai Shi, Zhuotao Tian, Tianyu He, and Li Jiang. Memory forcing: Spatio-temporal memory for consistent scene generation on minecraft.arXiv preprint arXiv:2510.03198, 2025

arXiv 2025
[14]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

Pith/arXiv arXiv 2025
[15]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 13

2024
[16]

Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[17]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[18]

Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

arXiv 2025
[19]

Vmem: Consistent interactive video scene generation with surfel-indexed view memory

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25690–25699, 2025

2025
[20]

Worldeval: World model as real-world robot policies evaluator, 2025

Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator, 2025

2025
[21]

Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Pith/arXiv arXiv 2025
[22]

Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

Pith/arXiv arXiv 2025
[23]

Rise-video: Can video generators decode implicit world rules? arXiv preprint arXiv:2602.05986, 2026

Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, et al. Rise-video: Can video generators decode implicit world rules? arXiv preprint arXiv:2602.05986, 2026

arXiv 2026
[24]

Realwonder: Real-time physical action-conditioned video generation.arXiv preprint arXiv:2603.05449, 2026

Wei Liu, Ziyu Chen, Zizhang Li, Yue Wang, Hong-Xing Yu, and Jiajun Wu. Realwonder: Real-time physical action-conditioned video generation.arXiv preprint arXiv:2603.05449, 2026

arXiv 2026
[25]

Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

arXiv 2025
[26]

Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

arXiv 2025
[27]

Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

arXiv 2024
[28]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[29]

Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

arXiv 2026
[30]

Motionstream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025

Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motionstream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025

arXiv 2025
[31]

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025. 14

Pith/arXiv arXiv 2025
[32]

Openworldlib: A unified codebase and definition of advanced world models.arXiv preprint arXiv:2604.04707, 2026

DataFlow Team, Bohan Zeng, Daili Hua, Kaixin Zhu, Yifan Dai, Bozhou Li, Yuran Wang, Chengzhuo Tong, Yifan Yang, Mingkun Chang, et al. Openworldlib: A unified codebase and definition of advanced world models.arXiv preprint arXiv:2604.04707, 2026

Pith/arXiv arXiv 2026
[33]

Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026

InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, et al. Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026

Pith/arXiv arXiv 2026
[34]

Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025

arXiv 2025
[35]

Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Pith/arXiv arXiv 2026
[36]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[37]

World-r1: Reinforcing 3d constraints for text-to-video generation

Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y Chen, Zhiyuan He, et al. World-r1: Reinforcing 3d constraints for text-to-video generation. arXiv preprint arXiv:2604.24764, 2026

Pith/arXiv arXiv 2026
[38]

Chen, Yuqing Yang, and Bohan Zhuang

Weijie Wang, Haoyu Zhao, Yifan Yang, Feng Chen, Zeyu Zhang, Yefei He, Zicheng Duan, Donny Y. Chen, Yuqing Yang, and Bohan Zhuang. Latent spatial memory for video world models.arXiv preprint arXiv:2606.09828, 2026

Pith/arXiv arXiv 2026
[39]

Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025

Pith/arXiv arXiv 2025
[40]

Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

arXiv 2026
[41]

Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

Pith/arXiv arXiv 2025
[42]

Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

arXiv 2025
[43]

Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

arXiv 2025
[44]

Lvd-2m: A long-take video dataset with temporally dense captions.Advances in Neural Information Processing Systems, 37:16623–16644, 2024

Tianwei Xiong, Yuqing Wang, Daquan Zhou, Zhijie Lin, Jiashi Feng, and Xihui Liu. Lvd-2m: A long-take video dataset with temporally dense captions.Advances in Neural Information Processing Systems, 37:16623–16644, 2024

2024
[45]

Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

Pith/arXiv arXiv 2025
[46]

Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024
[47]

Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, and Alex Jinpeng Wang. Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026. 15

arXiv 2026
[48]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025

2025
[49]

Wbench: A comprehensive multi-turn benchmark for interactive video world model evaluation.arXiv preprint arXiv:2605.25874, 2026

Kaining Ying, Hengrui Hu, Siyu Ren, Jiamu Li, Fengjiao Chen, Ziwen Wang, Xuezhi Cao, Xunliang Cai, and Henghui Ding. Wbench: A comprehensive multi-turn benchmark for interactive video world model evaluation.arXiv preprint arXiv:2605.25874, 2026

Pith/arXiv arXiv 2026
[50]

Mosaicmem: Hybrid spatial memory for controllable video world models.arXiv preprint arXiv:2603.17117, 2026

Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg, et al. Mosaicmem: Hybrid spatial memory for controllable video world models.arXiv preprint arXiv:2603.17117, 2026

arXiv 2026
[51]

Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

arXiv 2025
[52]

Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973, 2025

Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, and Bohan Zhuang. Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973, 2025

Pith/arXiv arXiv 2025
[53]

Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025

Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025

arXiv 2025
[54]

Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

Pith/arXiv arXiv 2025
[55]

person",

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autore- gressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026. 16 A WorldOlympiad Judge Prompt Templates The prompt templates below cover dynamic-object extraction, physical consistency, ...

Pith/arXiv arXiv 2026

[1] [1]

Cosmos world foundation model platform for physical ai

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025

[2] [2]

World simulation with video foundation models for physical ai

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062, 2025

Pith/arXiv arXiv 2025

[3] [3]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023

[4] [4]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

2024

[5] [5]

Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Pith/arXiv arXiv 2025

[6] [6]

Gamegen-x: Interactive open-world game video generation

Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation. InInternational Conference on Learning Representations, volume 2025, pages 37546–37593, 2025

2025

[7] [7]

Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

arXiv 2025

[8] [8]

Gemini 3 pro model card, 2025

Google DeepMind. Gemini 3 pro model card, 2025

2025

[9] [9]

Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

arXiv 2026

[10] [10]

Worldscore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025

2025

[11] [11]

Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

Pith/arXiv arXiv 2025

[12] [12]

Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models

Yue Hu, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, and Guanghui Ren. Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models. arXiv preprint arXiv:2505.09694, 2025

arXiv 2025

[13] [13]

Memory forcing: Spatio-temporal memory for consistent scene generation on minecraft.arXiv preprint arXiv:2510.03198, 2025

Junchao Huang, Xinting Hu, Boyao Han, Shaoshuai Shi, Zhuotao Tian, Tianyu He, and Li Jiang. Memory forcing: Spatio-temporal memory for consistent scene generation on minecraft.arXiv preprint arXiv:2510.03198, 2025

arXiv 2025

[14] [14]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

Pith/arXiv arXiv 2025

[15] [15]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 13

2024

[16] [16]

Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[17] [17]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[18] [18]

Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

arXiv 2025

[19] [19]

Vmem: Consistent interactive video scene generation with surfel-indexed view memory

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25690–25699, 2025

2025

[20] [20]

Worldeval: World model as real-world robot policies evaluator, 2025

Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator, 2025

2025

[21] [21]

Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Pith/arXiv arXiv 2025

[22] [22]

Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

Pith/arXiv arXiv 2025

[23] [23]

Rise-video: Can video generators decode implicit world rules? arXiv preprint arXiv:2602.05986, 2026

Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, et al. Rise-video: Can video generators decode implicit world rules? arXiv preprint arXiv:2602.05986, 2026

arXiv 2026

[24] [24]

Realwonder: Real-time physical action-conditioned video generation.arXiv preprint arXiv:2603.05449, 2026

Wei Liu, Ziyu Chen, Zizhang Li, Yue Wang, Hong-Xing Yu, and Jiajun Wu. Realwonder: Real-time physical action-conditioned video generation.arXiv preprint arXiv:2603.05449, 2026

arXiv 2026

[25] [25]

Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

arXiv 2025

[26] [26]

Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

arXiv 2025

[27] [27]

Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

arXiv 2024

[28] [28]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[29] [29]

Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

arXiv 2026

[30] [30]

Motionstream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025

Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motionstream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025

arXiv 2025

[31] [31]

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025. 14

Pith/arXiv arXiv 2025

[32] [32]

Openworldlib: A unified codebase and definition of advanced world models.arXiv preprint arXiv:2604.04707, 2026

DataFlow Team, Bohan Zeng, Daili Hua, Kaixin Zhu, Yifan Dai, Bozhou Li, Yuran Wang, Chengzhuo Tong, Yifan Yang, Mingkun Chang, et al. Openworldlib: A unified codebase and definition of advanced world models.arXiv preprint arXiv:2604.04707, 2026

Pith/arXiv arXiv 2026

[33] [33]

Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026

InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, et al. Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026

Pith/arXiv arXiv 2026

[34] [34]

Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025

arXiv 2025

[35] [35]

Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Pith/arXiv arXiv 2026

[36] [36]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[37] [37]

World-r1: Reinforcing 3d constraints for text-to-video generation

Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y Chen, Zhiyuan He, et al. World-r1: Reinforcing 3d constraints for text-to-video generation. arXiv preprint arXiv:2604.24764, 2026

Pith/arXiv arXiv 2026

[38] [38]

Chen, Yuqing Yang, and Bohan Zhuang

Weijie Wang, Haoyu Zhao, Yifan Yang, Feng Chen, Zeyu Zhang, Yefei He, Zicheng Duan, Donny Y. Chen, Yuqing Yang, and Bohan Zhuang. Latent spatial memory for video world models.arXiv preprint arXiv:2606.09828, 2026

Pith/arXiv arXiv 2026

[39] [39]

Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025

Pith/arXiv arXiv 2025

[40] [40]

Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

arXiv 2026

[41] [41]

Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

Pith/arXiv arXiv 2025

[42] [42]

Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

arXiv 2025

[43] [43]

Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

arXiv 2025

[44] [44]

Lvd-2m: A long-take video dataset with temporally dense captions.Advances in Neural Information Processing Systems, 37:16623–16644, 2024

Tianwei Xiong, Yuqing Wang, Daquan Zhou, Zhijie Lin, Jiashi Feng, and Xihui Liu. Lvd-2m: A long-take video dataset with temporally dense captions.Advances in Neural Information Processing Systems, 37:16623–16644, 2024

2024

[45] [45]

Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

Pith/arXiv arXiv 2025

[46] [46]

Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024

[47] [47]

Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, and Alex Jinpeng Wang. Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026. 15

arXiv 2026

[48] [48]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025

2025

[49] [49]

Wbench: A comprehensive multi-turn benchmark for interactive video world model evaluation.arXiv preprint arXiv:2605.25874, 2026

Kaining Ying, Hengrui Hu, Siyu Ren, Jiamu Li, Fengjiao Chen, Ziwen Wang, Xuezhi Cao, Xunliang Cai, and Henghui Ding. Wbench: A comprehensive multi-turn benchmark for interactive video world model evaluation.arXiv preprint arXiv:2605.25874, 2026

Pith/arXiv arXiv 2026

[50] [50]

Mosaicmem: Hybrid spatial memory for controllable video world models.arXiv preprint arXiv:2603.17117, 2026

Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg, et al. Mosaicmem: Hybrid spatial memory for controllable video world models.arXiv preprint arXiv:2603.17117, 2026

arXiv 2026

[51] [51]

Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

arXiv 2025

[52] [52]

Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973, 2025

Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, and Bohan Zhuang. Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973, 2025

Pith/arXiv arXiv 2025

[53] [53]

Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025

Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025

arXiv 2025

[54] [54]

Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

Pith/arXiv arXiv 2025

[55] [55]

person",

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autore- gressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026. 16 A WorldOlympiad Judge Prompt Templates The prompt templates below cover dynamic-object extraction, physical consistency, ...

Pith/arXiv arXiv 2026