pith. machine review for the scientific record. sign in

arxiv: 2604.21686 · v1 · submitted 2026-04-23 · 💻 cs.CV

Recognition: unknown

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Kaipeng Zhang, Kang He, Xiaofeng Mao, Xiaojie Xu, Yongtao Ge, Yuanyang Yin, Yukang Feng, Zhengyuan Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords benchmarkinteractive videoworld modelsimage-to-videoaction mappingevaluationvisual consistencycontrol alignment
0
0 comments X

The pith

WorldMark supplies the first common testbed of identical scenes, actions, and metrics for comparing interactive video world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Interactive video generation models have advanced rapidly but lacked fair comparison because each relied on private scenes and trajectories. WorldMark introduces a unified benchmark featuring an action-mapping layer, 500 standardized test cases across viewpoints and styles, and modular tools for assessing quality, alignment, and consistency. This setup allows direct evaluation of multiple models on the same inputs, which matters for identifying genuine progress in generating controllable and consistent interactive worlds. The accompanying online arena and public data release aim to support community-wide comparisons.

Core claim

WorldMark is introduced as the first benchmark providing a common playing field for interactive Image-to-Video world models by contributing a unified action-mapping layer that translates shared controls into native formats, a hierarchical test suite of 500 cases covering various viewpoints and difficulties, and a modular evaluation toolkit for visual quality, control alignment, and world consistency.

What carries the argument

The unified action-mapping layer that converts a shared WASD-style vocabulary into each model's specific control interface while preserving identical scenes and action sequences for cross-model testing.

If this is right

  • Models can be evaluated side-by-side on the same 500 cases without advantages from custom test data.
  • The modular toolkit supports evolving metrics while keeping test conditions fixed.
  • Public release of data, code, and model outputs enables reproduction and extension by others.
  • The World Model Arena allows ongoing public competitions with live leaderboards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the benchmark is widely used, it could reveal which model architectures maintain better world consistency over longer sequences.
  • Future work might extend the test suite to include more complex interactions or real-time feedback loops.
  • Adoption could push developers to improve native action interfaces to match the standardized controls.
  • The approach provides a template for standardizing evaluations in other generative video domains.

Load-bearing premise

The action-mapping layer converts controls without introducing bias that favors particular model architectures over others.

What would settle it

Finding that models perform differently or rankings change when using unmapped native controls on the same scenes would indicate the mapping affects results unfairly.

read the original abstract

Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (warena.ai), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces WorldMark as the first unified benchmark for interactive Image-to-Video world models. It contributes (1) a unified action-mapping layer that translates a shared WASD-style vocabulary into each model's native controls for cross-model evaluation on identical scenes and trajectories, (2) a hierarchical test suite of 500 cases spanning first- and third-person views, photorealistic and stylized scenes, and three difficulty tiers (Easy to Hard, 20-60s), and (3) a modular toolkit for metrics in Visual Quality, Control Alignment, and World Consistency. The authors commit to releasing all data, evaluation code, and model outputs, and introduce an online World Model Arena platform for live comparisons.

Significance. If the action-mapping layer proves faithful without introducing bias and the test cases/metrics reliably discriminate model quality, this benchmark could fill a critical gap by enabling standardized, reproducible comparisons across models that currently rely on private scenes and trajectories. The open release and modular design would support community reuse and evolution of metrics, potentially accelerating progress in the field.

major comments (2)
  1. [Abstract] Abstract (unified action-mapping layer description): The claim that the layer 'enables apples-to-apples comparison across six major models on identical scenes and trajectories' lacks any reported validation, such as per-model fidelity metrics, expert review of mapped trajectories, or ablations of mapped vs. native performance. This is load-bearing for the central claim, as discretization or omission of model-specific controls could systematically favor architectures whose native interfaces align more closely with the chosen mapping.
  2. [Test suite and evaluation toolkit description] Test suite and evaluation toolkit description: The manuscript states that the 500 cases and metrics (Visual Quality, Control Alignment, World Consistency) are designed to discriminate model quality across difficulty tiers, but supplies no evidence, ablations, or pilot results demonstrating that the chosen scenes, trajectories, and metrics actually do so (e.g., no baseline model rankings or sensitivity analysis).
minor comments (2)
  1. [Abstract] Clarify the exact criteria used to assign cases to Easy/Medium/Hard tiers and how viewpoint (first- vs. third-person) interacts with difficulty.
  2. The release commitment is positive; consider adding a specific timeline or repository link in the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (unified action-mapping layer description): The claim that the layer 'enables apples-to-apples comparison across six major models on identical scenes and trajectories' lacks any reported validation, such as per-model fidelity metrics, expert review of mapped trajectories, or ablations of mapped vs. native performance. This is load-bearing for the central claim, as discretization or omission of model-specific controls could systematically favor architectures whose native interfaces align more closely with the chosen mapping.

    Authors: We acknowledge that the submitted manuscript does not include quantitative validation metrics, expert reviews, or ablations for the action-mapping layer. The layer translates a shared WASD-style vocabulary to each model's native controls using their public documentation and interface specifications to preserve core action semantics. We will revise the paper to add a dedicated subsection with mapping examples, a discussion of design choices, and explicit limitations regarding potential loss of fine-grained controls. We agree this strengthens the central claim and plan to incorporate the addition. revision: partial

  2. Referee: [Test suite and evaluation toolkit description] Test suite and evaluation toolkit description: The manuscript states that the 500 cases and metrics (Visual Quality, Control Alignment, World Consistency) are designed to discriminate model quality across difficulty tiers, but supplies no evidence, ablations, or pilot results demonstrating that the chosen scenes, trajectories, and metrics actually do so (e.g., no baseline model rankings or sensitivity analysis).

    Authors: We agree the manuscript lacks pilot results or sensitivity analysis to demonstrate discrimination. The test suite was constructed with hierarchical tiers based on trajectory length, viewpoint, and scene complexity using domain expertise. In revision we will add a new subsection with preliminary evaluations of two models on a subset of Easy and Medium cases, showing metric score distributions to illustrate differentiation across tiers. The modular toolkit and full data release will support further community analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark contribution is self-contained with no derivations or self-referential claims

full rationale

The paper introduces WorldMark as a new benchmark with a unified action-mapping layer, 500 test cases, and modular evaluation toolkit for interactive video world models. No equations, fitted parameters, predictions, or derivation chains appear in the abstract or described structure. The central claim is the provision of standardized inputs and interfaces for cross-model comparison, which does not reduce to its own inputs by construction, self-citation load-bearing, or renaming of prior results. This matches the default expectation for non-circular infrastructural papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on the domain assumption that a single action vocabulary can be losslessly mapped to each model's native interface and that the chosen 500 cases are representative of real interactive use; no free parameters or new invented entities are introduced.

axioms (2)
  • domain assumption A common WASD-style action vocabulary can be translated into each model's native control format without systematic bias.
    Invoked in the description of the unified action-mapping layer.
  • domain assumption The 500 evaluation cases spanning viewpoints, styles, and difficulty tiers are sufficient to expose differences in visual quality, control alignment, and world consistency.
    Stated in the hierarchical test suite contribution.

pith-pipeline@v0.9.0 · 5595 in / 1317 out tokens · 39099 ms · 2026-05-09T21:52:39.149182+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 22 canonical work pages · 8 internal anchors

  1. [1]

    Alonso, A

    E. Alonso, A. Jelley, V . Micheli, A. Kanervisto, A. Storkey, T. Pearce, and F. Fleuret. Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V . Voleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

  3. [3]

    Bruce, M

    J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steiber, C. Apps, Y. Aytar, S. Bechtle, F. Behbahani, S. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rocktäschel. Genie: Generative interactive environments. InInternational Conferen...

  4. [4]

    J. Cen, C. Yu, H. Yuan, Y. Jiang, S. Huang, J. Guo, X. Li, Y. Song, H. Luo, F. Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  5. [5]

    Quevedo, Q

    Decart, J. Quevedo, Q. McIntyre, S. Campbell, X. Chen, and R. Wachen. Oasis: A universe in a transformer. 2024

  6. [6]

    Duan, H.-X

    H. Duan, H.-X. Yu, S. Chen, L. Fei-Fei, and J. Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025

  7. [7]

    Y. Fang, H. Zhu, Y. Zeng, K. Ma, and Z. Wang. Perceptual quality assessment of smartphone photography. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3677–3686, 2020

  8. [8]

    Gemini 3.1 pro, 2025

    Google DeepMind. Gemini 3.1 pro, 2025

  9. [9]

    Genie 3, 2025

    Google DeepMind. Genie 3, 2025. 12

  10. [10]

    Nano banana 2, 2026

    Google DeepMind. Nano banana 2, 2026

  11. [11]

    J. Guo, Y. Ye, T. He, H. Wu, Y. Jiang, T. Pearce, and J. Bian. Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025

  12. [12]

    Ha and J

    D. Ha and J. Schmidhuber. World models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 31, 2018

  13. [13]

    Hassan, S

    M. Hassan, S. Stapf, A. Rahimi, P . Rezende, Y. Haghighi, D. Brüggemann, I. Katircioglu, L. Zhang, X. Chen, S. Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22404–22415, 2025

  14. [14]

    H. He, Y. Xu, Y. Guo, G. Wetzstein, B. Dai, H. Li, and C. Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

  15. [15]

    X. He, C. Peng, Z. Liu, B. Wang, Y. Zhang, Q. Cui, F. Kang, B. Jiang, M. An, Y. Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

  16. [16]

    J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

  17. [17]

    Vbench++: Comprehensive and versatile benchmark suite for video generative models,

    Z. Huang et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models. arXiv preprint arXiv:2411.13503, 2024

  18. [18]

    Huang, Y

    Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, pages 21807–21818, 2024

  19. [19]

    J. Ke, Q. Wang, Y. Wang, P . Milanfar, and F. Yang. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021

  20. [20]

    W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  21. [21]

    aesthetic-predictor.https://github.com/LAION-AI/aesthetic-predictor, 2022

    LAION-AI. aesthetic-predictor.https://github.com/LAION-AI/aesthetic-predictor, 2022

  22. [22]

    D. Li, Y. Fang, Y. Chen, S. Yang, S. Cao, J. Wong, M. Luo, X. Wang, H. Yin, J. E. Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

  23. [23]

    J. Li, J. Tang, Z. Xu, L. Wu, Y. Zhou, S. Shao, T. Yu, Z. Cao, and Q. Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2(3):6, 2025

  24. [24]

    Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y. Huang, H. Sun, J. Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

  25. [25]

    Y. Lu, W. Luo, P . Tu, H. Li, H. Zhu, Z. Yu, X. Wang, X. Chen, X. Peng, X. Li, et al. 4dworldbench: A com- prehensive evaluation framework for 3d/4d world generation models.arXiv preprint arXiv:2511.19836, 2025

  26. [26]

    X. Mao, Z. Li, C. Li, X. Xu, K. Ying, T. He, J. Pang, Y. Qiao, and K. Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

  27. [27]

    X. Mao, S. Lin, Z. Li, C. Li, W. Peng, T. He, J. Pang, M. Chi, Y. Qiao, and K. Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

  28. [28]

    F. Meng, J. Liao, X. Tan, Q. Lu, W. Shao, K. Zhang, Y. Cheng, D. Li, and P . Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation. InInternational Conference on Machine Learning, pages 43781–43806, 2025. 13

  29. [29]

    W. Peng, G. Wang, T. Yang, C. Li, X. Xu, H. He, and K. Zhang. Svbench: Evaluation of video generation models on social reasoning.arXiv preprint arXiv:2512.21507, 2025

  30. [30]

    W. Sun, H. Zhang, H. Wang, J. Wu, Z. Wang, Z. Wang, Y. Wang, J. Zhang, T. Wang, and C. Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

  31. [31]

    J. Tang, J. Liu, J. Li, L. Wu, H. Yang, P . Zhao, S. Gong, X. Yuan, S. Shao, and Q. Lu. Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

  32. [32]

    H. Team, Z. Wang, Y. Liu, J. Wu, Z. Gu, H. Wang, X. Zuo, T. Huang, W. Li, S. Zhang, et al. Hunyuan- world 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

  33. [33]

    Teed and J

    Z. Teed and J. Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021

  34. [34]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2019

  35. [35]

    Diffusion models are real-time game engines

    D. Valevski, Y. Leviathan, M. Arar, and S. Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

  36. [36]

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  37. [37]

    Z. Wang, Z. Yuan, X. Wang, Y. Li, T. Chen, M. Xia, P . Luo, and Y. Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH, 2024

  38. [38]

    Z. Xiao, L. Yushi, Y. Zhou, W. Ouyang, S. Yang, Y. Zeng, and X. Pan. Worldmem: Long-term consistent world simulation with memory. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  39. [39]

    M. Yang, Y. Du, K. Ghasemipour, J. Tompson, D. Schuurmans, and P . Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations (ICLR), 2024

  40. [40]

    Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations

  41. [41]

    Y. Ye, X. Lu, Y. Jiang, Y. Gu, R. Zhao, Q. Liang, J. Pan, F. Zhang, W. Wu, and A. J. Wang. MIND: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

  42. [42]

    Zhang, P

    R. Zhang, P . Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  43. [43]

    Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

    Y. Zhang, C. Peng, B. Wang, P . Wang, Q. Zhu, F. Kang, B. Jiang, Z. Gao, E. Li, Y. Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

  44. [44]

    G. Zhao, X. Wang, Z. Zhu, X. Chen, G. Huang, X. Bao, and X. Wang. Drivedreamer-2: Llm-enhanced world models for diverse driving video generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10412–10420, 2025

  45. [45]

    Zheng, W.-L

    L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P . Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 14