WorldOlympiad: Can Your World Model Survive a Triathlon?
Pith reviewed 2026-06-27 13:03 UTC · model grok-4.3
The pith
WorldOlympiad is a benchmark that splits world-model evaluation into physical rules, 3D geometry, and long-horizon interaction tracks to expose failures missed by visual-quality tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WorldOlympiad decomposes evaluation into a physical track that uses segmentation and MLLM judgments on mechanics, heat, and materials; a geometry track that reconstructs scenes with Gaussian splatting to check cross-view and camera consistency; and an interaction track that scores adherence to action prompts across consecutive video segments. The benchmark spans gaming, robotics, and real-world scenarios. Experiments on state-of-the-art models find substantial gaps in physical faithfulness, geometric coherence, and sustained interaction control.
What carries the argument
WorldOlympiad benchmark with its physical track (object segmentation plus MLLM judge), geometry track (Gaussian splatting reconstruction), and interaction track (action-prompt following across chunks).
If this is right
- World-model development should optimize directly for the three measured properties rather than proxy metrics such as visual fidelity alone.
- Evaluation protocols for generative models will need to include explicit physical, geometric, and long-horizon interaction tests to be considered complete.
- Downstream applications in robotics and gaming will continue to show control failures until models close the gaps identified on the interaction track.
Where Pith is reading between the lines
- A model that passes the geometry track may still fail downstream 3D tasks if the Gaussian reconstruction only captures surface appearance rather than underlying structure.
- The benchmark could be extended by adding quantitative physics simulators as ground truth to reduce reliance on MLLM judgment.
- If the three tracks prove additive, training objectives that jointly optimize all three could produce more general world models than single-objective approaches.
Load-bearing premise
The measurements produced by MLLM judges and Gaussian splatting reconstructions accurately reflect the intended physical and geometric properties without introducing their own systematic biases.
What would settle it
A follow-up study that applies the same three tracks to the same models and obtains high scores across all dimensions with no additional training would falsify the reported gaps.
read the original abstract
We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short-term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world-model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM-as-judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real-world videos, capturing diverse challenges from interactive control and embodied manipulation to open-domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state-of-the-art models reveal substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WorldOlympiad, a benchmark decomposing world-model evaluation into physical faithfulness (object segmentation + MLLM-as-judge on mechanics/thermal/material rules), geometric consistency (Gaussian splatting reconstruction for structural/cross-view/camera alignment), and interaction fidelity (action-prompt following and chunk transitions). It targets three scenarios (gaming, robotics, real-world videos) and claims that experiments on SOTA models expose substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, motivating more structured protocols beyond visual or semantic quality metrics.
Significance. If the proposed tracks prove reliable after validation, the benchmark could usefully expose failure modes in generative world models that current short-term coherence or visual-quality tests miss. The decomposition into interpretable dimensions and coverage of downstream scenarios is a constructive step, but the absence of any reported metrics, calibration data, or error analysis in the manuscript limits its current contribution.
major comments (3)
- [Physical track] Physical track description (abstract and methods): the MLLM-as-judge proxy for mechanics, thermal, and material rules is presented without any calibration against human labels or synthetic ground-truth violations; this is load-bearing because the central claim of 'substantial gaps in physical reasoning' rests directly on these scores.
- [Geometry track] Geometry track description (abstract and methods): Gaussian splatting reconstruction is used to measure structural/cross-view/camera consistency with no reported analysis of its known failure modes on dynamic scenes or low-coherence video; this directly affects the reliability of the '3D consistency' gap claim.
- [Experiments] Experiments section (abstract): the manuscript states that 'experiments on state-of-the-art models reveal substantial gaps' yet supplies no quantitative metrics, model identifiers, error bars, or comparison tables, preventing assessment of whether the observed differences exceed metric noise.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from explicit definitions of the three downstream scenarios and how the tracks map onto them.
- [Introduction] Notation for the three tracks is introduced without a summary table; adding one would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas for strengthening the reliability of the proposed benchmark. We agree that additional validation and reporting are needed to support the claims regarding gaps in world models. The revised manuscript will incorporate calibration studies, failure-mode analyses, and expanded experimental details as outlined below.
read point-by-point responses
-
Referee: [Physical track] Physical track description (abstract and methods): the MLLM-as-judge proxy for mechanics, thermal, and material rules is presented without any calibration against human labels or synthetic ground-truth violations; this is load-bearing because the central claim of 'substantial gaps in physical reasoning' rests directly on these scores.
Authors: We agree that calibration is necessary for the MLLM-as-judge to robustly support claims of physical reasoning gaps. The manuscript presents the MLLM judge as a scalable proxy drawing on prior VLM evaluation practices, but we will revise to add a validation subsection. This will report agreement with human labels on a held-out set of 200 clips (with inter-annotator agreement) and performance on synthetic videos containing controlled rule violations (e.g., inverted gravity or melting objects). Results will appear in the methods and experiments sections of the revision. revision: yes
-
Referee: [Geometry track] Geometry track description (abstract and methods): Gaussian splatting reconstruction is used to measure structural/cross-view/camera consistency with no reported analysis of its known failure modes on dynamic scenes or low-coherence video; this directly affects the reliability of the '3D consistency' gap claim.
Authors: We acknowledge that Gaussian splatting can exhibit reconstruction artifacts on dynamic or low-coherence videos, potentially affecting geometry track reliability. The revision will add an explicit analysis of these failure modes, including per-video reconstruction quality metrics (e.g., novel-view PSNR) stratified by scene dynamics, discussion of impact on consistency scores, and mitigation approaches such as quality thresholding or supplementary 2D metrics. These will be reported in the geometry track description and experiments. revision: yes
-
Referee: [Experiments] Experiments section (abstract): the manuscript states that 'experiments on state-of-the-art models reveal substantial gaps' yet supplies no quantitative metrics, model identifiers, error bars, or comparison tables, preventing assessment of whether the observed differences exceed metric noise.
Authors: The full manuscript contains experimental results across models and scenarios, but we agree the presentation requires greater transparency to evaluate the gap claims. The revision will expand the experiments section with explicit model identifiers and versions, complete comparison tables, error bars (standard deviation over multiple seeds or generations), and statistical significance tests. This will enable readers to assess whether differences exceed metric variability. revision: yes
Circularity Check
No circularity: benchmark is an external evaluation protocol
full rationale
The paper defines three evaluation tracks (physical via segmentation+MLLM judge, geometry via Gaussian splatting, interaction via action prompts) and applies them empirically to SOTA models. No equations, fitted parameters, self-citations, or derivations are used to generate the reported gaps; the results are direct measurements on held-out generated videos. This is a standard benchmark paper with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption MLLM-as-judge accurately assesses physical faithfulness without bias
- domain assumption Gaussian splatting reconstruction yields reliable measures of structural consistency
Reference graph
Works this paper leans on
-
[1]
Cosmos world foundation model platform for physical ai
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025
Pith/arXiv arXiv 2025
-
[2]
World simulation with video foundation models for physical ai
Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062, 2025
Pith/arXiv arXiv 2025
-
[3]
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
Pith/arXiv arXiv 2023
-
[4]
Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024
2024
-
[5]
Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025
Pith/arXiv arXiv 2025
-
[6]
Gamegen-x: Interactive open-world game video generation
Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation. InInternational Conference on Learning Representations, volume 2025, pages 37546–37593, 2025
2025
-
[7]
Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025
arXiv 2025
-
[8]
Gemini 3 pro model card, 2025
Google DeepMind. Gemini 3 pro model card, 2025
2025
-
[9]
Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026
Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026
arXiv 2026
-
[10]
Worldscore: A unified evaluation benchmark for world generation
Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025
2025
-
[11]
Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025
Pith/arXiv arXiv 2025
-
[12]
Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models
Yue Hu, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, and Guanghui Ren. Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models. arXiv preprint arXiv:2505.09694, 2025
arXiv 2025
-
[13]
Junchao Huang, Xinting Hu, Boyao Han, Shaoshuai Shi, Zhuotao Tian, Tianyu He, and Li Jiang. Memory forcing: Spatio-temporal memory for consistent scene generation on minecraft.arXiv preprint arXiv:2510.03198, 2025
arXiv 2025
-
[14]
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025
Pith/arXiv arXiv 2025
-
[15]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 13
2024
-
[16]
Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[17]
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
Pith/arXiv arXiv 2024
-
[18]
Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025
arXiv 2025
-
[19]
Vmem: Consistent interactive video scene generation with surfel-indexed view memory
Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25690–25699, 2025
2025
-
[20]
Worldeval: World model as real-world robot policies evaluator, 2025
Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator, 2025
2025
-
[21]
Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025
Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025
Pith/arXiv arXiv 2025
-
[22]
Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025
Pith/arXiv arXiv 2025
-
[23]
Rise-video: Can video generators decode implicit world rules? arXiv preprint arXiv:2602.05986, 2026
Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, et al. Rise-video: Can video generators decode implicit world rules? arXiv preprint arXiv:2602.05986, 2026
arXiv 2026
-
[24]
Wei Liu, Ziyu Chen, Zizhang Li, Yue Wang, Hong-Xing Yu, and Jiajun Wu. Realwonder: Real-time physical action-conditioned video generation.arXiv preprint arXiv:2603.05449, 2026
arXiv 2026
-
[25]
Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025
Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025
arXiv 2025
-
[26]
Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025
Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, and Kaipeng Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025
arXiv 2025
-
[27]
Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024
arXiv 2024
-
[28]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[29]
Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026
arXiv 2026
-
[30]
Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, and Xun Huang. Motionstream: Real-time video generation with interactive motion controls.arXiv preprint arXiv:2511.01266, 2025
arXiv 2025
-
[31]
Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025. 14
Pith/arXiv arXiv 2025
-
[32]
DataFlow Team, Bohan Zeng, Daili Hua, Kaixin Zhu, Yifan Dai, Bozhou Li, Yuran Wang, Chengzhuo Tong, Yifan Yang, Mingkun Chang, et al. Openworldlib: A unified codebase and definition of advanced world models.arXiv preprint arXiv:2604.04707, 2026
Pith/arXiv arXiv 2026
-
[33]
InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, et al. Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026
Pith/arXiv arXiv 2026
-
[34]
Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025
Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025
arXiv 2025
-
[35]
Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026
Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026
Pith/arXiv arXiv 2026
-
[36]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Pith/arXiv arXiv 2025
-
[37]
World-r1: Reinforcing 3d constraints for text-to-video generation
Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y Chen, Zhiyuan He, et al. World-r1: Reinforcing 3d constraints for text-to-video generation. arXiv preprint arXiv:2604.24764, 2026
Pith/arXiv arXiv 2026
-
[38]
Chen, Yuqing Yang, and Bohan Zhuang
Weijie Wang, Haoyu Zhao, Yifan Yang, Feng Chen, Zeyu Zhang, Yefei He, Zicheng Duan, Donny Y. Chen, Yuqing Yang, and Bohan Zhuang. Latent spatial memory for video world models.arXiv preprint arXiv:2606.09828, 2026
Pith/arXiv arXiv 2026
-
[39]
Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025
Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025
Pith/arXiv arXiv 2025
-
[40]
Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, et al. Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026
arXiv 2026
-
[41]
Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025
Pith/arXiv arXiv 2025
-
[42]
Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025
Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025
arXiv 2025
-
[43]
Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025
Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025
arXiv 2025
-
[44]
Lvd-2m: A long-take video dataset with temporally dense captions.Advances in Neural Information Processing Systems, 37:16623–16644, 2024
Tianwei Xiong, Yuqing Wang, Daquan Zhou, Zhijie Lin, Jiashi Feng, and Xihui Liu. Lvd-2m: A long-take video dataset with temporally dense captions.Advances in Neural Information Processing Systems, 37:16623–16644, 2024
2024
-
[45]
Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025
Pith/arXiv arXiv 2025
-
[46]
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
Pith/arXiv arXiv 2024
-
[47]
Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, and Alex Jinpeng Wang. Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026. 15
arXiv 2026
-
[48]
From slow bidirectional to fast autoregressive video diffusion models
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025
2025
-
[49]
Kaining Ying, Hengrui Hu, Siyu Ren, Jiamu Li, Fengjiao Chen, Ziwen Wang, Xuezhi Cao, Xunliang Cai, and Henghui Ding. Wbench: A comprehensive multi-turn benchmark for interactive video world model evaluation.arXiv preprint arXiv:2605.25874, 2026
Pith/arXiv arXiv 2026
-
[50]
Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg, et al. Mosaicmem: Hybrid spatial memory for controllable video world models.arXiv preprint arXiv:2603.17117, 2026
arXiv 2026
-
[51]
Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025
Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025
arXiv 2025
-
[52]
Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, and Bohan Zhuang. Blockvid: Block diffusion for high-quality and consistent minute-long video generation.arXiv preprint arXiv:2511.22973, 2025
Pith/arXiv arXiv 2025
-
[53]
Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025
Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory.arXiv preprint arXiv:2512.15716, 2025
arXiv 2025
-
[54]
Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025
Pith/arXiv arXiv 2025
-
[55]
Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autore- gressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026. 16 A WorldOlympiad Judge Prompt Templates The prompt templates below cover dynamic-object extraction, physical consistency, ...
Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.