WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Fengjiao Chen; Henghui Ding; Hengrui Hu; Jiamu Li; Kaining Ying; Siyu Ren; Xuezhi Cao; Xunliang Cai; Ziwen Wang

arxiv: 2605.25874 · v1 · pith:MN2WZK4Cnew · submitted 2026-05-25 · 💻 cs.CV

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Kaining Ying , Hengrui Hu , Siyu Ren , Jiamu Li , Fengjiao Chen , Ziwen Wang , Xuezhi Cao , Xunliang Cai

show 1 more author

Henghui Ding

This is my paper

classification 💻 cs.CV

keywords evaluationmodelswbenchinteractionmodelworldinteractivemulti-turn

0 comments

read the original abstract

Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model evaluation along five dimensions, namely video quality, setting adherence, interaction adherence, consistency, and physics compliance. WBench contains 289 test cases and 1,058 interaction turns, where each case specifies a world setting and a multi-turn interaction sequence, covering diverse scenes, styles, subjects, and both first- and third-person perspectives, together with four interaction types, including navigation, subject action, event editing, and perspective switching. For navigation, WBench unifies text, 6-DoF pose, and discrete-action control, enabling evaluation of models with different native input interfaces. Evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments. Across 20 state-of-the-art models, we find that no single model performs strongly across all dimensions. We provide detailed diagnostic insights into the characteristic strengths, weaknesses, and open challenges of each model. Code and data are available at https://github.com/meituan-longcat/WBench.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Current World Models Lack a Persistent State Core
cs.CV 2026-06 unverdicted novelty 6.0

Current world models fail to evolve internal state when unobserved and instead resume scenes at the last observed state, as diagnosed by the new WRBench benchmark across 23 models and 9600 videos.
Einstein World Models
cs.AI 2026-06 unverdicted novelty 5.0

Einstein World Models integrate visual rollouts from a callable world-module into LLM reasoning traces to support complex thought beyond language.
WorldOlympiad: Can Your World Model Survive a Triathlon?
cs.CV 2026-06 unverdicted novelty 5.0

WorldOlympiad is a new benchmark decomposing world-model evaluation into physical, geometry, and interaction tracks using segmentation, MLLM judges, Gaussian splatting, and action prompts on diverse scenarios.