pith. sign in

arxiv: 2606.11129 · v2 · pith:YZIX6QSEnew · submitted 2026-06-09 · 💻 cs.CV

WorldOlympiad: Can Your World Model Survive a Triathlon?

Pith reviewed 2026-06-27 13:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords world modelsvideo generationbenchmarkphysical reasoninggeometric consistencyinteraction fidelitygenerative evaluation
0
0 comments X

The pith

WorldOlympiad is a benchmark that splits world-model evaluation into physical rules, 3D geometry, and long-horizon interaction tracks to expose failures missed by visual-quality tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WorldOlympiad as a three-track benchmark that measures whether generated videos obey mechanics and material rules, maintain consistent 3D structure across views, and follow complex action sequences over extended rollouts. Existing tests focus on appearance or short clips and therefore miss whether models have built usable internal simulations of the world. By applying the benchmark to current models across gaming, robotics, and open-world video, the work shows clear shortfalls in all three areas and argues that future progress requires evaluation protocols built around these specific properties rather than generic video metrics.

Core claim

WorldOlympiad decomposes evaluation into a physical track that uses segmentation and MLLM judgments on mechanics, heat, and materials; a geometry track that reconstructs scenes with Gaussian splatting to check cross-view and camera consistency; and an interaction track that scores adherence to action prompts across consecutive video segments. The benchmark spans gaming, robotics, and real-world scenarios. Experiments on state-of-the-art models find substantial gaps in physical faithfulness, geometric coherence, and sustained interaction control.

What carries the argument

WorldOlympiad benchmark with its physical track (object segmentation plus MLLM judge), geometry track (Gaussian splatting reconstruction), and interaction track (action-prompt following across chunks).

If this is right

  • World-model development should optimize directly for the three measured properties rather than proxy metrics such as visual fidelity alone.
  • Evaluation protocols for generative models will need to include explicit physical, geometric, and long-horizon interaction tests to be considered complete.
  • Downstream applications in robotics and gaming will continue to show control failures until models close the gaps identified on the interaction track.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A model that passes the geometry track may still fail downstream 3D tasks if the Gaussian reconstruction only captures surface appearance rather than underlying structure.
  • The benchmark could be extended by adding quantitative physics simulators as ground truth to reduce reliance on MLLM judgment.
  • If the three tracks prove additive, training objectives that jointly optimize all three could produce more general world models than single-objective approaches.

Load-bearing premise

The measurements produced by MLLM judges and Gaussian splatting reconstructions accurately reflect the intended physical and geometric properties without introducing their own systematic biases.

What would settle it

A follow-up study that applies the same three tracks to the same models and obtains high scores across all dimensions with no additional training would falsify the reported gaps.

read the original abstract

We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short-term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world-model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM-as-judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real-world videos, capturing diverse challenges from interactive control and embodied manipulation to open-domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state-of-the-art models reveal substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces WorldOlympiad, a benchmark decomposing world-model evaluation into physical faithfulness (object segmentation + MLLM-as-judge on mechanics/thermal/material rules), geometric consistency (Gaussian splatting reconstruction for structural/cross-view/camera alignment), and interaction fidelity (action-prompt following and chunk transitions). It targets three scenarios (gaming, robotics, real-world videos) and claims that experiments on SOTA models expose substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, motivating more structured protocols beyond visual or semantic quality metrics.

Significance. If the proposed tracks prove reliable after validation, the benchmark could usefully expose failure modes in generative world models that current short-term coherence or visual-quality tests miss. The decomposition into interpretable dimensions and coverage of downstream scenarios is a constructive step, but the absence of any reported metrics, calibration data, or error analysis in the manuscript limits its current contribution.

major comments (3)
  1. [Physical track] Physical track description (abstract and methods): the MLLM-as-judge proxy for mechanics, thermal, and material rules is presented without any calibration against human labels or synthetic ground-truth violations; this is load-bearing because the central claim of 'substantial gaps in physical reasoning' rests directly on these scores.
  2. [Geometry track] Geometry track description (abstract and methods): Gaussian splatting reconstruction is used to measure structural/cross-view/camera consistency with no reported analysis of its known failure modes on dynamic scenes or low-coherence video; this directly affects the reliability of the '3D consistency' gap claim.
  3. [Experiments] Experiments section (abstract): the manuscript states that 'experiments on state-of-the-art models reveal substantial gaps' yet supplies no quantitative metrics, model identifiers, error bars, or comparison tables, preventing assessment of whether the observed differences exceed metric noise.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from explicit definitions of the three downstream scenarios and how the tracks map onto them.
  2. [Introduction] Notation for the three tracks is introduced without a summary table; adding one would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas for strengthening the reliability of the proposed benchmark. We agree that additional validation and reporting are needed to support the claims regarding gaps in world models. The revised manuscript will incorporate calibration studies, failure-mode analyses, and expanded experimental details as outlined below.

read point-by-point responses
  1. Referee: [Physical track] Physical track description (abstract and methods): the MLLM-as-judge proxy for mechanics, thermal, and material rules is presented without any calibration against human labels or synthetic ground-truth violations; this is load-bearing because the central claim of 'substantial gaps in physical reasoning' rests directly on these scores.

    Authors: We agree that calibration is necessary for the MLLM-as-judge to robustly support claims of physical reasoning gaps. The manuscript presents the MLLM judge as a scalable proxy drawing on prior VLM evaluation practices, but we will revise to add a validation subsection. This will report agreement with human labels on a held-out set of 200 clips (with inter-annotator agreement) and performance on synthetic videos containing controlled rule violations (e.g., inverted gravity or melting objects). Results will appear in the methods and experiments sections of the revision. revision: yes

  2. Referee: [Geometry track] Geometry track description (abstract and methods): Gaussian splatting reconstruction is used to measure structural/cross-view/camera consistency with no reported analysis of its known failure modes on dynamic scenes or low-coherence video; this directly affects the reliability of the '3D consistency' gap claim.

    Authors: We acknowledge that Gaussian splatting can exhibit reconstruction artifacts on dynamic or low-coherence videos, potentially affecting geometry track reliability. The revision will add an explicit analysis of these failure modes, including per-video reconstruction quality metrics (e.g., novel-view PSNR) stratified by scene dynamics, discussion of impact on consistency scores, and mitigation approaches such as quality thresholding or supplementary 2D metrics. These will be reported in the geometry track description and experiments. revision: yes

  3. Referee: [Experiments] Experiments section (abstract): the manuscript states that 'experiments on state-of-the-art models reveal substantial gaps' yet supplies no quantitative metrics, model identifiers, error bars, or comparison tables, preventing assessment of whether the observed differences exceed metric noise.

    Authors: The full manuscript contains experimental results across models and scenarios, but we agree the presentation requires greater transparency to evaluate the gap claims. The revision will expand the experiments section with explicit model identifiers and versions, complete comparison tables, error bars (standard deviation over multiple seeds or generations), and statistical significance tests. This will enable readers to assess whether differences exceed metric variability. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark is an external evaluation protocol

full rationale

The paper defines three evaluation tracks (physical via segmentation+MLLM judge, geometry via Gaussian splatting, interaction via action prompts) and applies them empirically to SOTA models. No equations, fitted parameters, self-citations, or derivations are used to generate the reported gaps; the results are direct measurements on held-out generated videos. This is a standard benchmark paper with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; evaluation relies on external components (MLLM judges, Gaussian splatting) whose internal assumptions are unstated.

axioms (2)
  • domain assumption MLLM-as-judge accurately assesses physical faithfulness without bias
    Invoked for the physical track evaluation.
  • domain assumption Gaussian splatting reconstruction yields reliable measures of structural consistency
    Invoked for the geometry track.

pith-pipeline@v0.9.1-grok · 5805 in / 1344 out tokens · 31630 ms · 2026-06-27T13:03:50.157202+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 24 linked inside Pith

  1. [1]

    Cosmos world foundation model platform for physical ai

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    World simulation with video foundation models for physical ai

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062, 2025

  3. [3]

    Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  4. [4]

    Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

  5. [5]

    Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  6. [6]

    Gamegen-x: Interactive open-world game video generation

    Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation. InInternational Conference on Learning Representations, volume 2025, pages 37546–37593, 2025

  7. [7]

    Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

    Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

  8. [8]

    Gemini 3 pro model card, 2025

    Google DeepMind. Gemini 3 pro model card, 2025

  9. [9]

    Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

    Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

  10. [10]

    Worldscore: A unified evaluation benchmark for world generation

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025

  11. [11]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

  12. [12]

    Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models

    Yue Hu, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, and Guanghui Ren. Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models. arXiv preprint arXiv:2505.09694, 2025

  13. [13]

    Memory forcing: Spatio-temporal memory for consistent scene generation on minecraft.arXiv preprint arXiv:2510.03198, 2025

    Junchao Huang, Xinting Hu, Boyao Han, Shaoshuai Shi, Zhuotao Tian, Tianyu He, and Li Jiang. Memory forcing: Spatio-temporal memory for consistent scene generation on minecraft.arXiv preprint arXiv:2510.03198, 2025

  14. [14]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  15. [15]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 13

  16. [16]

    Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  17. [17]

    Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  18. [18]

    Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

    Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

  19. [19]

    Vmem: Consistent interactive video scene generation with surfel-indexed view memory

    Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25690–25699, 2025

  20. [20]

    Worldeval: World model as real-world robot policies evaluator, 2025

    Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator, 2025

  21. [21]

    Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  22. [22]

    Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

  23. [23]

    Rise-video: Can video generators decode implicit world rules? arXiv preprint arXiv:2602.05986, 2026

    Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, et al. Rise-video: Can video generators decode implicit world rules? arXiv preprint arXiv:2602.05986, 2026

  24. [24]

    Realwonder: Real-time physical action-conditioned video generation.arXiv preprint arXiv:2603.05449, 2026

    Wei Liu, Ziyu Chen, Zizhang Li, Yue Wang, Hong-Xing Yu, and Jiajun Wu. Realwonder: Real-time physical action-conditioned video generation.arXiv preprint arXiv:2603.05449, 2026

  25. [25]

    Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

    Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

  26. [26]

    Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

    Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

  27. [27]

    Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

    Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

  28. [28]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  29. [29]

    Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

    Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

  30. [30]

    Motionstream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025

    Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motionstream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025

  31. [31]

    Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

    Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025. 14

  32. [32]

    Openworldlib: A unified codebase and definition of advanced world models.arXiv preprint arXiv:2604.04707, 2026

    DataFlow Team, Bohan Zeng, Daili Hua, Kaixin Zhu, Yifan Dai, Bozhou Li, Yuran Wang, Chengzhuo Tong, Yifan Yang, Mingkun Chang, et al. Openworldlib: A unified codebase and definition of advanced world models.arXiv preprint arXiv:2604.04707, 2026

  33. [33]

    Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026

    InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, et al. Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026

  34. [34]

    Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025

    Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025

  35. [35]

    Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

  36. [36]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  37. [37]

    World-r1: Reinforcing 3d constraints for text-to-video generation

    Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y Chen, Zhiyuan He, et al. World-r1: Reinforcing 3d constraints for text-to-video generation. arXiv preprint arXiv:2604.24764, 2026

  38. [38]

    Chen, Yuqing Yang, and Bohan Zhuang

    Weijie Wang, Haoyu Zhao, Yifan Yang, Feng Chen, Zeyu Zhang, Yefei He, Zicheng Duan, Donny Y. Chen, Yuqing Yang, and Bohan Zhuang. Latent spatial memory for video world models.arXiv preprint arXiv:2606.09828, 2026

  39. [39]

    Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025

    Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025

  40. [40]

    Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

    Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

  41. [41]

    Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

    Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025

  42. [42]

    Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

    Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

  43. [43]

    Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

    Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

  44. [44]

    Lvd-2m: A long-take video dataset with temporally dense captions.Advances in Neural Information Processing Systems, 37:16623–16644, 2024

    Tianwei Xiong, Yuqing Wang, Daquan Zhou, Zhijie Lin, Jiashi Feng, and Xihui Liu. Lvd-2m: A long-take video dataset with temporally dense captions.Advances in Neural Information Processing Systems, 37:16623–16644, 2024

  45. [45]

    Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

  46. [46]

    Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  47. [47]

    Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

    Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, and Alex Jinpeng Wang. Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026. 15

  48. [48]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025

  49. [49]

    Wbench: A comprehensive multi-turn benchmark for interactive video world model evaluation.arXiv preprint arXiv:2605.25874, 2026

    Kaining Ying, Hengrui Hu, Siyu Ren, Jiamu Li, Fengjiao Chen, Ziwen Wang, Xuezhi Cao, Xunliang Cai, and Henghui Ding. Wbench: A comprehensive multi-turn benchmark for interactive video world model evaluation.arXiv preprint arXiv:2605.25874, 2026

  50. [50]

    Mosaicmem: Hybrid spatial memory for controllable video world models.arXiv preprint arXiv:2603.17117, 2026

    Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg, et al. Mosaicmem: Hybrid spatial memory for controllable video world models.arXiv preprint arXiv:2603.17117, 2026

  51. [51]

    Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

    Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

  52. [52]

    Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973, 2025

    Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, and Bohan Zhuang. Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973, 2025

  53. [53]

    Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025

    Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025

  54. [54]

    Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

  55. [55]

    person",

    Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autore- gressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026. 16 A WorldOlympiad Judge Prompt Templates The prompt templates below cover dynamic-object extraction, physical consistency, ...