arxiv: 2604.21686 · v1 · submitted 2026-04-23 · 💻 cs.CV

Recognition: unknown

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Kaipeng Zhang, Kang He, Xiaofeng Mao, Xiaojie Xu, Yongtao Ge, Yuanyang Yin, Yukang Feng, Zhengyuan Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords benchmarkinteractive videoworld modelsimage-to-videoaction mappingevaluationvisual consistencycontrol alignment

0 comments

The pith

WorldMark supplies the first common testbed of identical scenes, actions, and metrics for comparing interactive video world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Interactive video generation models have advanced rapidly but lacked fair comparison because each relied on private scenes and trajectories. WorldMark introduces a unified benchmark featuring an action-mapping layer, 500 standardized test cases across viewpoints and styles, and modular tools for assessing quality, alignment, and consistency. This setup allows direct evaluation of multiple models on the same inputs, which matters for identifying genuine progress in generating controllable and consistent interactive worlds. The accompanying online arena and public data release aim to support community-wide comparisons.

Core claim

WorldMark is introduced as the first benchmark providing a common playing field for interactive Image-to-Video world models by contributing a unified action-mapping layer that translates shared controls into native formats, a hierarchical test suite of 500 cases covering various viewpoints and difficulties, and a modular evaluation toolkit for visual quality, control alignment, and world consistency.

What carries the argument

The unified action-mapping layer that converts a shared WASD-style vocabulary into each model's specific control interface while preserving identical scenes and action sequences for cross-model testing.

If this is right

Models can be evaluated side-by-side on the same 500 cases without advantages from custom test data.
The modular toolkit supports evolving metrics while keeping test conditions fixed.
Public release of data, code, and model outputs enables reproduction and extension by others.
The World Model Arena allows ongoing public competitions with live leaderboards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the benchmark is widely used, it could reveal which model architectures maintain better world consistency over longer sequences.
Future work might extend the test suite to include more complex interactions or real-time feedback loops.
Adoption could push developers to improve native action interfaces to match the standardized controls.
The approach provides a template for standardizing evaluations in other generative video domains.

Load-bearing premise

The action-mapping layer converts controls without introducing bias that favors particular model architectures over others.

What would settle it

Finding that models perform differently or rankings change when using unmapped native controls on the same scenes would indicate the mapping affects results unfairly.

read the original abstract

Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (warena.ai), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WorldMark gives the field a shared testbed with identical scenes and controls across models, but the action-mapping layer still needs validation to support the fairness claim.

read the letter

The main takeaway is that this paper builds a benchmark with the same scenes, trajectories, and a translated control vocabulary so six different interactive video world models can be compared directly instead of each running its own private tests. That addresses a real practical problem in the area. The work also supplies 500 cases split into easy-to-hard tiers, first- and third-person views, and a modular toolkit for quality, alignment, and consistency metrics, plus an online arena for live comparisons. Releasing the data, code, and model outputs is the right step for something positioned as infrastructure. The soft spot is the unified action-mapping layer itself. The abstract presents it as a faithful translator from a shared WASD-style set into each model's native format, yet supplies no fidelity checks, per-model ablations, or evidence that the mapping preserves capability without systematic tilt. If the translation discretizes continuous actions or drops model-specific nuances, the standardized trajectories could favor architectures whose native interfaces happen to align better, which would undermine the apples-to-apples goal. The stress-test note on this point holds up from the description given. This is for researchers who evaluate or train interactive world models and want measurable cross-model progress rather than incomparable private numbers. It deserves a serious referee because a working common benchmark would be genuinely useful infrastructure, even if the mapping validation has to be strengthened in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces WorldMark as the first unified benchmark for interactive Image-to-Video world models. It contributes (1) a unified action-mapping layer that translates a shared WASD-style vocabulary into each model's native controls for cross-model evaluation on identical scenes and trajectories, (2) a hierarchical test suite of 500 cases spanning first- and third-person views, photorealistic and stylized scenes, and three difficulty tiers (Easy to Hard, 20-60s), and (3) a modular toolkit for metrics in Visual Quality, Control Alignment, and World Consistency. The authors commit to releasing all data, evaluation code, and model outputs, and introduce an online World Model Arena platform for live comparisons.

Significance. If the action-mapping layer proves faithful without introducing bias and the test cases/metrics reliably discriminate model quality, this benchmark could fill a critical gap by enabling standardized, reproducible comparisons across models that currently rely on private scenes and trajectories. The open release and modular design would support community reuse and evolution of metrics, potentially accelerating progress in the field.

major comments (2)

[Abstract] Abstract (unified action-mapping layer description): The claim that the layer 'enables apples-to-apples comparison across six major models on identical scenes and trajectories' lacks any reported validation, such as per-model fidelity metrics, expert review of mapped trajectories, or ablations of mapped vs. native performance. This is load-bearing for the central claim, as discretization or omission of model-specific controls could systematically favor architectures whose native interfaces align more closely with the chosen mapping.
[Test suite and evaluation toolkit description] Test suite and evaluation toolkit description: The manuscript states that the 500 cases and metrics (Visual Quality, Control Alignment, World Consistency) are designed to discriminate model quality across difficulty tiers, but supplies no evidence, ablations, or pilot results demonstrating that the chosen scenes, trajectories, and metrics actually do so (e.g., no baseline model rankings or sensitivity analysis).

minor comments (2)

[Abstract] Clarify the exact criteria used to assign cases to Easy/Medium/Hard tiers and how viewpoint (first- vs. third-person) interacts with difficulty.
The release commitment is positive; consider adding a specific timeline or repository link in the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract (unified action-mapping layer description): The claim that the layer 'enables apples-to-apples comparison across six major models on identical scenes and trajectories' lacks any reported validation, such as per-model fidelity metrics, expert review of mapped trajectories, or ablations of mapped vs. native performance. This is load-bearing for the central claim, as discretization or omission of model-specific controls could systematically favor architectures whose native interfaces align more closely with the chosen mapping.

Authors: We acknowledge that the submitted manuscript does not include quantitative validation metrics, expert reviews, or ablations for the action-mapping layer. The layer translates a shared WASD-style vocabulary to each model's native controls using their public documentation and interface specifications to preserve core action semantics. We will revise the paper to add a dedicated subsection with mapping examples, a discussion of design choices, and explicit limitations regarding potential loss of fine-grained controls. We agree this strengthens the central claim and plan to incorporate the addition. revision: partial
Referee: [Test suite and evaluation toolkit description] Test suite and evaluation toolkit description: The manuscript states that the 500 cases and metrics (Visual Quality, Control Alignment, World Consistency) are designed to discriminate model quality across difficulty tiers, but supplies no evidence, ablations, or pilot results demonstrating that the chosen scenes, trajectories, and metrics actually do so (e.g., no baseline model rankings or sensitivity analysis).

Authors: We agree the manuscript lacks pilot results or sensitivity analysis to demonstrate discrimination. The test suite was constructed with hierarchical tiers based on trajectory length, viewpoint, and scene complexity using domain expertise. In revision we will add a new subsection with preliminary evaluations of two models on a subset of Easy and Medium cases, showing metric score distributions to illustrate differentiation across tiers. The modular toolkit and full data release will support further community analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark contribution is self-contained with no derivations or self-referential claims

full rationale

The paper introduces WorldMark as a new benchmark with a unified action-mapping layer, 500 test cases, and modular evaluation toolkit for interactive video world models. No equations, fitted parameters, predictions, or derivation chains appear in the abstract or described structure. The central claim is the provision of standardized inputs and interfaces for cross-model comparison, which does not reduce to its own inputs by construction, self-citation load-bearing, or renaming of prior results. This matches the default expectation for non-circular infrastructural papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on the domain assumption that a single action vocabulary can be losslessly mapped to each model's native interface and that the chosen 500 cases are representative of real interactive use; no free parameters or new invented entities are introduced.

axioms (2)

domain assumption A common WASD-style action vocabulary can be translated into each model's native control format without systematic bias.
Invoked in the description of the unified action-mapping layer.
domain assumption The 500 evaluation cases spanning viewpoints, styles, and difficulty tiers are sufficient to expose differences in visual quality, control alignment, and world consistency.
Stated in the hierarchical test suite contribution.

pith-pipeline@v0.9.0 · 5595 in / 1317 out tokens · 39099 ms · 2026-05-09T21:52:39.149182+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 22 canonical work pages · 8 internal anchors

[1]

Alonso, A

E. Alonso, A. Jelley, V . Micheli, A. Kanervisto, A. Storkey, T. Pearce, and F. Fleuret. Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V . Voleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review arXiv 2023
[3]

Bruce, M

J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steiber, C. Apps, Y. Aytar, S. Bechtle, F. Behbahani, S. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rocktäschel. Genie: Generative interactive environments. InInternational Conferen...

2024
[4]

J. Cen, C. Yu, H. Yuan, Y. Jiang, S. Huang, J. Guo, X. Li, Y. Song, H. Luo, F. Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review arXiv 2025
[5]

Quevedo, Q

Decart, J. Quevedo, Q. McIntyre, S. Campbell, X. Chen, and R. Wachen. Oasis: A universe in a transformer. 2024

2024
[6]

Duan, H.-X

H. Duan, H.-X. Yu, S. Chen, L. Fei-Fei, and J. Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025

2025
[7]

Y. Fang, H. Zhu, Y. Zeng, K. Ma, and Z. Wang. Perceptual quality assessment of smartphone photography. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3677–3686, 2020

2020
[8]

Gemini 3.1 pro, 2025

Google DeepMind. Gemini 3.1 pro, 2025

2025
[9]

Genie 3, 2025

Google DeepMind. Genie 3, 2025. 12

2025
[10]

Nano banana 2, 2026

Google DeepMind. Nano banana 2, 2026

2026
[11]

J. Guo, Y. Ye, T. He, H. Wu, Y. Jiang, T. Pearce, and J. Bian. Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025

work page arXiv 2025
[12]

Ha and J

D. Ha and J. Schmidhuber. World models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 31, 2018

2018
[13]

Hassan, S

M. Hassan, S. Stapf, A. Rahimi, P . Rezende, Y. Haghighi, D. Brüggemann, I. Katircioglu, L. Zhang, X. Chen, S. Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22404–22415, 2025

2025
[14]

H. He, Y. Xu, Y. Guo, G. Wetzstein, B. Dai, H. Li, and C. Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

work page internal anchor Pith review arXiv 2024
[15]

X. He, C. Peng, Z. Liu, B. Wang, Y. Zhang, Q. Cui, F. Kang, B. Jiang, M. An, Y. Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

2022
[17]

Vbench++: Comprehensive and versatile benchmark suite for video generative models,

Z. Huang et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models. arXiv preprint arXiv:2411.13503, 2024

work page arXiv 2024
[18]

Huang, Y

Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, pages 21807–21818, 2024

2024
[19]

J. Ke, Q. Wang, Y. Wang, P . Milanfar, and F. Yang. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021

2021
[20]

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

aesthetic-predictor.https://github.com/LAION-AI/aesthetic-predictor, 2022

LAION-AI. aesthetic-predictor.https://github.com/LAION-AI/aesthetic-predictor, 2022

2022
[22]

D. Li, Y. Fang, Y. Chen, S. Yang, S. Cao, J. Wong, M. Luo, X. Wang, H. Yin, J. E. Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

work page arXiv 2025
[23]

J. Li, J. Tang, Z. Xu, L. Wu, Y. Zhou, S. Shao, T. Yu, Z. Cao, and Q. Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2(3):6, 2025

work page arXiv 2025
[24]

Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y. Huang, H. Sun, J. Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

work page internal anchor Pith review arXiv 2024
[25]

Y. Lu, W. Luo, P . Tu, H. Li, H. Zhu, Z. Yu, X. Wang, X. Chen, X. Peng, X. Li, et al. 4dworldbench: A com- prehensive evaluation framework for 3d/4d world generation models.arXiv preprint arXiv:2511.19836, 2025

work page arXiv 2025
[26]

X. Mao, Z. Li, C. Li, X. Xu, K. Ying, T. He, J. Pang, Y. Qiao, and K. Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

work page arXiv 2025
[27]

X. Mao, S. Lin, Z. Li, C. Li, W. Peng, T. He, J. Pang, M. Chi, Y. Qiao, and K. Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

work page arXiv 2025
[28]

F. Meng, J. Liao, X. Tan, Q. Lu, W. Shao, K. Zhang, Y. Cheng, D. Li, and P . Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation. InInternational Conference on Machine Learning, pages 43781–43806, 2025. 13

2025
[29]

W. Peng, G. Wang, T. Yang, C. Li, X. Xu, H. He, and K. Zhang. Svbench: Evaluation of video generation models on social reasoning.arXiv preprint arXiv:2512.21507, 2025

work page arXiv 2025
[30]

W. Sun, H. Zhang, H. Wang, J. Wu, Z. Wang, Z. Wang, Y. Wang, J. Zhang, T. Wang, and C. Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

work page arXiv 2025
[31]

J. Tang, J. Liu, J. Li, L. Wu, H. Yang, P . Zhao, S. Gong, X. Yuan, S. Shao, and Q. Lu. Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

work page arXiv 2025
[32]

H. Team, Z. Wang, Y. Liu, J. Wu, Z. Gu, H. Wang, X. Zuo, T. Huang, W. Li, S. Zhang, et al. Hunyuan- world 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

work page arXiv 2025
[33]

Teed and J

Z. Teed and J. Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021

2021
[34]

Towards Accurate Generative Models of Video: A New Metric & Challenges

T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2019

work page internal anchor Pith review arXiv 2019
[35]

Diffusion models are real-time game engines

D. Valevski, Y. Leviathan, M. Arar, and S. Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

work page arXiv 2024
[36]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Z. Wang, Z. Yuan, X. Wang, Y. Li, T. Chen, M. Xia, P . Luo, and Y. Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH, 2024

2024
[38]

Z. Xiao, L. Yushi, Y. Zhou, W. Ouyang, S. Yang, Y. Zeng, and X. Pan. Worldmem: Long-term consistent world simulation with memory. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
[39]

M. Yang, Y. Du, K. Ghasemipour, J. Tompson, D. Schuurmans, and P . Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations (ICLR), 2024

2024
[40]

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations
[41]

Y. Ye, X. Lu, Y. Jiang, Y. Gu, R. Zhao, Q. Liang, J. Pan, F. Zhang, W. Wu, and A. J. Wang. MIND: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

work page arXiv 2026
[42]

Zhang, P

R. Zhang, P . Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

2018
[43]

Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

Y. Zhang, C. Peng, B. Wang, P . Wang, Q. Zhu, F. Kang, B. Jiang, Z. Gao, E. Li, Y. Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

work page arXiv 2025
[44]

G. Zhao, X. Wang, Z. Zhu, X. Chen, G. Huang, X. Bao, and X. Wang. Drivedreamer-2: Llm-enhanced world models for diverse driving video generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10412–10420, 2025

2025
[45]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P . Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 14

2023