Recognition: unknown
WorldMark: A Unified Benchmark Suite for Interactive Video World Models
Pith reviewed 2026-05-09 21:52 UTC · model grok-4.3
The pith
WorldMark supplies the first common testbed of identical scenes, actions, and metrics for comparing interactive video world models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WorldMark is introduced as the first benchmark providing a common playing field for interactive Image-to-Video world models by contributing a unified action-mapping layer that translates shared controls into native formats, a hierarchical test suite of 500 cases covering various viewpoints and difficulties, and a modular evaluation toolkit for visual quality, control alignment, and world consistency.
What carries the argument
The unified action-mapping layer that converts a shared WASD-style vocabulary into each model's specific control interface while preserving identical scenes and action sequences for cross-model testing.
If this is right
- Models can be evaluated side-by-side on the same 500 cases without advantages from custom test data.
- The modular toolkit supports evolving metrics while keeping test conditions fixed.
- Public release of data, code, and model outputs enables reproduction and extension by others.
- The World Model Arena allows ongoing public competitions with live leaderboards.
Where Pith is reading between the lines
- If the benchmark is widely used, it could reveal which model architectures maintain better world consistency over longer sequences.
- Future work might extend the test suite to include more complex interactions or real-time feedback loops.
- Adoption could push developers to improve native action interfaces to match the standardized controls.
- The approach provides a template for standardizing evaluations in other generative video domains.
Load-bearing premise
The action-mapping layer converts controls without introducing bias that favors particular model architectures over others.
What would settle it
Finding that models perform differently or rankings change when using unmapped native controls on the same scenes would indicate the mapping affects results unfairly.
read the original abstract
Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (warena.ai), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces WorldMark as the first unified benchmark for interactive Image-to-Video world models. It contributes (1) a unified action-mapping layer that translates a shared WASD-style vocabulary into each model's native controls for cross-model evaluation on identical scenes and trajectories, (2) a hierarchical test suite of 500 cases spanning first- and third-person views, photorealistic and stylized scenes, and three difficulty tiers (Easy to Hard, 20-60s), and (3) a modular toolkit for metrics in Visual Quality, Control Alignment, and World Consistency. The authors commit to releasing all data, evaluation code, and model outputs, and introduce an online World Model Arena platform for live comparisons.
Significance. If the action-mapping layer proves faithful without introducing bias and the test cases/metrics reliably discriminate model quality, this benchmark could fill a critical gap by enabling standardized, reproducible comparisons across models that currently rely on private scenes and trajectories. The open release and modular design would support community reuse and evolution of metrics, potentially accelerating progress in the field.
major comments (2)
- [Abstract] Abstract (unified action-mapping layer description): The claim that the layer 'enables apples-to-apples comparison across six major models on identical scenes and trajectories' lacks any reported validation, such as per-model fidelity metrics, expert review of mapped trajectories, or ablations of mapped vs. native performance. This is load-bearing for the central claim, as discretization or omission of model-specific controls could systematically favor architectures whose native interfaces align more closely with the chosen mapping.
- [Test suite and evaluation toolkit description] Test suite and evaluation toolkit description: The manuscript states that the 500 cases and metrics (Visual Quality, Control Alignment, World Consistency) are designed to discriminate model quality across difficulty tiers, but supplies no evidence, ablations, or pilot results demonstrating that the chosen scenes, trajectories, and metrics actually do so (e.g., no baseline model rankings or sensitivity analysis).
minor comments (2)
- [Abstract] Clarify the exact criteria used to assign cases to Easy/Medium/Hard tiers and how viewpoint (first- vs. third-person) interacts with difficulty.
- The release commitment is positive; consider adding a specific timeline or repository link in the camera-ready version.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract (unified action-mapping layer description): The claim that the layer 'enables apples-to-apples comparison across six major models on identical scenes and trajectories' lacks any reported validation, such as per-model fidelity metrics, expert review of mapped trajectories, or ablations of mapped vs. native performance. This is load-bearing for the central claim, as discretization or omission of model-specific controls could systematically favor architectures whose native interfaces align more closely with the chosen mapping.
Authors: We acknowledge that the submitted manuscript does not include quantitative validation metrics, expert reviews, or ablations for the action-mapping layer. The layer translates a shared WASD-style vocabulary to each model's native controls using their public documentation and interface specifications to preserve core action semantics. We will revise the paper to add a dedicated subsection with mapping examples, a discussion of design choices, and explicit limitations regarding potential loss of fine-grained controls. We agree this strengthens the central claim and plan to incorporate the addition. revision: partial
-
Referee: [Test suite and evaluation toolkit description] Test suite and evaluation toolkit description: The manuscript states that the 500 cases and metrics (Visual Quality, Control Alignment, World Consistency) are designed to discriminate model quality across difficulty tiers, but supplies no evidence, ablations, or pilot results demonstrating that the chosen scenes, trajectories, and metrics actually do so (e.g., no baseline model rankings or sensitivity analysis).
Authors: We agree the manuscript lacks pilot results or sensitivity analysis to demonstrate discrimination. The test suite was constructed with hierarchical tiers based on trajectory length, viewpoint, and scene complexity using domain expertise. In revision we will add a new subsection with preliminary evaluations of two models on a subset of Easy and Medium cases, showing metric score distributions to illustrate differentiation across tiers. The modular toolkit and full data release will support further community analysis. revision: yes
Circularity Check
No circularity: benchmark contribution is self-contained with no derivations or self-referential claims
full rationale
The paper introduces WorldMark as a new benchmark with a unified action-mapping layer, 500 test cases, and modular evaluation toolkit for interactive video world models. No equations, fitted parameters, predictions, or derivation chains appear in the abstract or described structure. The central claim is the provision of standardized inputs and interfaces for cross-model comparison, which does not reduce to its own inputs by construction, self-citation load-bearing, or renaming of prior results. This matches the default expectation for non-circular infrastructural papers.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A common WASD-style action vocabulary can be translated into each model's native control format without systematic bias.
- domain assumption The 500 evaluation cases spanning viewpoints, styles, and difficulty tiers are sufficient to expose differences in visual quality, control alignment, and world consistency.
Reference graph
Works this paper leans on
-
[1]
Alonso, A
E. Alonso, A. Jelley, V . Micheli, A. Kanervisto, A. Storkey, T. Pearce, and F. Fleuret. Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V . Voleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review arXiv 2023
-
[3]
Bruce, M
J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steiber, C. Apps, Y. Aytar, S. Bechtle, F. Behbahani, S. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rocktäschel. Genie: Generative interactive environments. InInternational Conferen...
2024
-
[4]
J. Cen, C. Yu, H. Yuan, Y. Jiang, S. Huang, J. Guo, X. Li, Y. Song, H. Luo, F. Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025
work page internal anchor Pith review arXiv 2025
-
[5]
Quevedo, Q
Decart, J. Quevedo, Q. McIntyre, S. Campbell, X. Chen, and R. Wachen. Oasis: A universe in a transformer. 2024
2024
-
[6]
Duan, H.-X
H. Duan, H.-X. Yu, S. Chen, L. Fei-Fei, and J. Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025
2025
-
[7]
Y. Fang, H. Zhu, Y. Zeng, K. Ma, and Z. Wang. Perceptual quality assessment of smartphone photography. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3677–3686, 2020
2020
-
[8]
Gemini 3.1 pro, 2025
Google DeepMind. Gemini 3.1 pro, 2025
2025
-
[9]
Genie 3, 2025
Google DeepMind. Genie 3, 2025. 12
2025
-
[10]
Nano banana 2, 2026
Google DeepMind. Nano banana 2, 2026
2026
- [11]
-
[12]
Ha and J
D. Ha and J. Schmidhuber. World models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 31, 2018
2018
-
[13]
Hassan, S
M. Hassan, S. Stapf, A. Rahimi, P . Rezende, Y. Haghighi, D. Brüggemann, I. Katircioglu, L. Zhang, X. Chen, S. Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22404–22415, 2025
2025
-
[14]
H. He, Y. Xu, Y. Guo, G. Wetzstein, B. Dai, H. Li, and C. Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024
work page internal anchor Pith review arXiv 2024
-
[15]
X. He, C. Peng, Z. Liu, B. Wang, Y. Zhang, Q. Cui, F. Kang, B. Jiang, M. An, Y. Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022
2022
-
[17]
Vbench++: Comprehensive and versatile benchmark suite for video generative models,
Z. Huang et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models. arXiv preprint arXiv:2411.13503, 2024
-
[18]
Huang, Y
Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, pages 21807–21818, 2024
2024
-
[19]
J. Ke, Q. Wang, Y. Wang, P . Milanfar, and F. Yang. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021
2021
-
[20]
W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
aesthetic-predictor.https://github.com/LAION-AI/aesthetic-predictor, 2022
LAION-AI. aesthetic-predictor.https://github.com/LAION-AI/aesthetic-predictor, 2022
2022
- [22]
- [23]
-
[24]
Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y. Huang, H. Sun, J. Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024
work page internal anchor Pith review arXiv 2024
- [25]
- [26]
- [27]
-
[28]
F. Meng, J. Liao, X. Tan, Q. Lu, W. Shao, K. Zhang, Y. Cheng, D. Li, and P . Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation. InInternational Conference on Machine Learning, pages 43781–43806, 2025. 13
2025
- [29]
- [30]
- [31]
- [32]
-
[33]
Teed and J
Z. Teed and J. Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021
2021
-
[34]
Towards Accurate Generative Models of Video: A New Metric & Challenges
T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2019
work page internal anchor Pith review arXiv 2019
-
[35]
Diffusion models are real-time game engines
D. Valevski, Y. Leviathan, M. Arar, and S. Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024
-
[36]
T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Z. Wang, Z. Yuan, X. Wang, Y. Li, T. Chen, M. Xia, P . Luo, and Y. Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH, 2024
2024
-
[38]
Z. Xiao, L. Yushi, Y. Zhou, W. Ouyang, S. Yang, Y. Zeng, and X. Pan. Worldmem: Long-term consistent world simulation with memory. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[39]
M. Yang, Y. Du, K. Ghasemipour, J. Tompson, D. Schuurmans, and P . Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations (ICLR), 2024
2024
-
[40]
Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations
- [41]
-
[42]
Zhang, P
R. Zhang, P . Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018
2018
-
[43]
Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025
Y. Zhang, C. Peng, B. Wang, P . Wang, Q. Zhu, F. Kang, B. Jiang, Z. Gao, E. Li, Y. Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025
-
[44]
G. Zhao, X. Wang, Z. Zhu, X. Chen, G. Huang, X. Bao, and X. Wang. Drivedreamer-2: Llm-enhanced world models for diverse driving video generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10412–10420, 2025
2025
-
[45]
Zheng, W.-L
L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P . Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 14
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.