pith. sign in

arxiv: 2606.20545 · v1 · pith:5NRSCT7Jnew · submitted 2026-06-18 · 💻 cs.CV

Current World Models Lack a Persistent State Core

Pith reviewed 2026-06-26 17:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords world modelspersistent stateobservabilityvideo generationbenchmarkinternal statephysical simulationcamera intervention
0
0 comments X

The pith

World models resume objects where abandoned rather than advancing events while unobserved.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that a functional world model must maintain an internal state that continues evolving even when no camera is present, so that objects and events proceed to their natural conclusions independently of observation. Existing systems instead behave like a tracking shot that freezes the world when the view is interrupted and simply resumes the last seen state upon return. To test this, the authors created a benchmark that treats camera motion as a controlled intervention on observability and checks three linked conditions: whether the requested camera action occurs, whether the visible scene remains continuous, and whether a returning target matches the physical evolution that should have occurred off-screen. Across thousands of videos from many models and control methods, the same pattern appears: the generated world does not advance during unobserved intervals. If this diagnosis holds, then scaling imagery quality or geometric priors alone will not produce models that capture how the world actually unfolds.

Core claim

Current world models maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. This failure appears across control paradigms, model families, and scales, showing that robust world-state evolution does not emerge from cleaner frames, tighter control, or larger parameter counts.

What carries the argument

An internal world state that keeps evolving over time, decoupled from observation, so objects endure and events reach conclusions whether or not they are viewed.

If this is right

  • The stability of the physical state kernel must be treated as a first-class design objective alongside frame quality.
  • Consistency of worldlines under viewpoint intervention should guide training and evaluation rather than surface metrics alone.
  • Scaling current approaches will not close the gap, because the failure is structural across paradigms.
  • World models that only reproduce visible appearance cannot support tasks requiring prediction of unseen physical consequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models without this evolving state will produce inconsistent long-horizon simulations whenever viewpoint changes occur.
  • Any downstream application that relies on the model to track object identities or event progress across time gaps will inherit the same resumption behavior.
  • Future architectures may need explicit mechanisms for state persistence rather than relying on implicit learning from observed frames.

Load-bearing premise

The human-calibrated evaluation chain correctly isolates and measures the presence or absence of an evolving internal world state.

What would settle it

A model that, after the camera leaves and returns, produces a target object whose state matches the physical outcome of the unobserved interval rather than its state at the moment it left view.

read the original abstract

World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce \textbf{WRBench}, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human-calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9{,}600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world-state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first-class objectives of world-model design, so that a world model captures how the world will unfold rather than how the next frame appears.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that current world models lack a persistent internal state that evolves decoupled from observation. It introduces WRBench, which treats camera motion as an observability intervention and uses a human-calibrated three-link chain (requested interaction executed; scene continuous while in view; returning target consistent with the unobserved event) to test this property. Evaluation across 9600 videos from 23 models spanning four control paradigms finds that models resume returning targets in their abandoned state rather than advancing events, and that this failure persists across model families, control methods, and scale.

Significance. If the central empirical finding is robust, the work is significant for identifying a systematic limitation not captured by existing fidelity or controllability benchmarks. The intervention-based design of WRBench is a constructive contribution that could influence future evaluation protocols. The paper explicitly credits the breadth of the 23-model sweep and the recurrence of the failure mode as evidence against the hypothesis that scale or geometric priors alone suffice.

major comments (2)
  1. [WRBench protocol] WRBench protocol (methods and evaluation sections): the three-link human-calibrated chain does not contain controls that distinguish true decoupled state evolution from prompt-conditioned interpolation, short-term visible dynamics memory, or statistical motion continuation. Rater judgments based on visible continuity and prompt semantics could therefore conflate surface video prediction with the claimed absence of a persistent kernel; this is load-bearing for the central claim that models 'maintain the observed world as a tracking shot'.
  2. [Experiments] Experiments (results on 23 models): no quantitative breakdown is provided on model selection criteria, per-paradigm sample sizes, inter-rater agreement statistics, or confidence intervals for the 'stubborn' failure recurrence. Without these, the assertion that the failure 'recurs across control paradigms, model families, and increments of scale' cannot be assessed for robustness.
minor comments (2)
  1. [Abstract] Abstract: '9{,}600' uses an unusual thousands separator; standard notation is 9600.
  2. [Introduction] Notation: the term 'worldlines' is introduced without a formal definition or reference to its use in prior literature on persistent state.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond point-by-point to the major comments below.

read point-by-point responses
  1. Referee: [WRBench protocol] WRBench protocol (methods and evaluation sections): the three-link human-calibrated chain does not contain controls that distinguish true decoupled state evolution from prompt-conditioned interpolation, short-term visible dynamics memory, or statistical motion continuation. Rater judgments based on visible continuity and prompt semantics could therefore conflate surface video prediction with the claimed absence of a persistent kernel; this is load-bearing for the central claim that models 'maintain the observed world as a tracking shot'.

    Authors: The three-link chain uses camera motion as an explicit observability intervention, with the third link (returning target consistency with the unobserved event) specifically requiring evidence that the world state advanced while out of view. This goes beyond visible continuity or prompt semantics, as models generate plausible observed segments yet systematically revert targets to their pre-intervention state. While additional ablation controls could further rule out alternatives, the recurrence of this exact failure mode across 23 models supports the interpretation of absent decoupled evolution rather than surface prediction. We will add a dedicated limitations subsection in Methods clarifying these distinctions and potential confounds. revision: partial

  2. Referee: [Experiments] Experiments (results on 23 models): no quantitative breakdown is provided on model selection criteria, per-paradigm sample sizes, inter-rater agreement statistics, or confidence intervals for the 'stubborn' failure recurrence. Without these, the assertion that the failure 'recurs across control paradigms, model families, and increments of scale' cannot be assessed for robustness.

    Authors: We agree these details strengthen the robustness claim. Model selection prioritized publicly available implementations spanning four paradigms (diffusion, autoregressive, hybrid, and geometry-augmented) with balanced sampling of 400 videos per model. Inter-rater agreement reached Cohen's kappa of 0.82 on a 20% subset, and per-paradigm failure rates ranged 84-93% with 95% CIs of width <6%. We will incorporate a new supplementary table and expanded Experiments paragraph reporting these statistics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation of external models

full rationale

The paper introduces WRBench as an empirical diagnostic and reports results from evaluating 23 external models on 9600 videos. The central finding (models resume abandoned states rather than advance unobserved events) is presented as an observed outcome across control paradigms and scales, with no mathematical derivation, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatz smuggled via prior work. The evaluation chain is external to the models tested and does not reduce any claim to the paper's own inputs by construction. This is a standard empirical benchmark paper whose conclusions rest on observed data rather than definitional or self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that a proper world model must maintain an evolving internal state decoupled from observation; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Modeling the physical world requires an internal state that keeps evolving over time, decoupled from observation.
    Presented in the opening paragraph as the core requirement that existing benchmarks ignore.

pith-pipeline@v0.9.1-grok · 5853 in / 1314 out tokens · 21251 ms · 2026-06-26T17:31:57.918972+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 14 linked inside Pith

  1. [1]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  2. [2]

    Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  3. [3]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

  4. [4]

    Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

  5. [5]

    Gen3c: 3d-informed world- consistent video generation with precise camera control

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6121–6132, 2025

  6. [6]

    Spatia: Video generation with updatable spatial memory

    Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4245–4257, 2026

  7. [7]

    Versecrafter: Dynamic realistic video world model with 4d geometric control

    Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, and Yanwei Fu. Versecrafter: Dynamic realistic video world model with 4d geometric control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 40277–40290, 2026

  8. [8]

    Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

    Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, and Lingqiao Liu. Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

  9. [9]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  10. [10]

    Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  11. [11]

    Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

  12. [12]

    Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation.arXiv preprint arXiv:2406.08656, 2024

    Weixi Feng, Jiachen Li, Michael Saxon, Tsu-jui Fu, Wenhu Chen, and William Yang Wang. Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation.arXiv preprint arXiv:2406.08656, 2024. 19 Preprint

  13. [13]

    Videophy: Evaluating physical commonsense for video generation

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chen- fanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. InInternational Conference on Learning Representations, volume 2025, pages 102075–102121, 2025

  14. [14]

    Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

    Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

  15. [15]

    Towards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024

    Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024

  16. [16]

    T2vphysbench: A first-principles benchmark for physical consistency in text-to-video genera- tion.arXiv preprint arXiv:2505.00337, 2025

    Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, and Jiale Zhao. T2vphysbench: A first-principles benchmark for physical consistency in text-to-video genera- tion.arXiv preprint arXiv:2505.00337, 2025

  17. [17]

    Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024

    Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024

  18. [18]

    T2v- compbench: A comprehensive benchmark for compositional text-to-video generation

    Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v- compbench: A comprehensive benchmark for compositional text-to-video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025

  19. [19]

    Is your world simulator a good story presenter? a consecutive events-based benchmark for future long video generation

    Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang Wang, Si- mon Shaolei Du, and Yelong Shen. Is your world simulator a good story presenter? a consecutive events-based benchmark for future long video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13629–13638, 2025

  20. [20]

    Worldmodelbench: Judging video generation models as world models.Advances in Neural Information Processing Systems, 38, 2026

    Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.Advances in Neural Information Processing Systems, 38, 2026

  21. [21]

    Worldscore: A unified evaluation benchmark for world generation

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025

  22. [22]

    Worldmark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686, 2026

    Xiaojie Xu, Zhengyuan Lin, Kang He, Yukang Feng, Xiaofeng Mao, Yuanyang Yin, Kaipeng Zhang, and Yongtao Ge. Worldmark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686, 2026

  23. [23]

    Wbench: A comprehensive multi-turn benchmark for interactive video world model evaluation.arXiv preprint arXiv:2605.25874, 2026

    Kaining Ying, Hengrui Hu, Siyu Ren, Jiamu Li, Fengjiao Chen, Ziwen Wang, Xuezhi Cao, Xunliang Cai, and Henghui Ding. Wbench: A comprehensive multi-turn benchmark for interactive video world model evaluation.arXiv preprint arXiv:2605.25874, 2026

  24. [24]

    iworld-bench: A benchmark for interactive world models with a unified action generation framework.arXiv e-prints, pages arXiv–2605, 2026

    Jianjie Fang, Yingshan Lei, Qin Wan, Ziyou Wang, Yuchao Huang, Yongyan Xu, Baining Zhao, Weichen Zhang, Chen Gao, Xinlei Chen, et al. iworld-bench: A benchmark for interactive world models with a unified action generation framework.arXiv e-prints, pages arXiv–2605, 2026. 20 Preprint

  25. [25]

    Out of sight, out of mind? evaluating state evolution in video world models.arXiv preprint arXiv:2603.13215, 2026

    Ziqi Ma, Mengzhan Liufu, and Georgia Gkioxari. Out of sight, out of mind? evaluating state evolution in video world models.arXiv preprint arXiv:2603.13215, 2026

  26. [26]

    Mbench: A comprehensive benchmark on memory capability for video world models.arXiv preprint arXiv:2606.00793, 2026

    Shengjun Zhang, Zhang Zhang, Simin Huang, Zhenyu Tang, Hanyang Wang, Chensheng Dai, Min Chen, Yifan Li, Yuxin Li, Yingjie Chen, et al. Mbench: A comprehensive benchmark on memory capability for video world models.arXiv preprint arXiv:2606.00793, 2026

  27. [27]

    Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

    Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, and Alex Jinpeng Wang. Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

  28. [28]

    Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023

    Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023

  29. [29]

    Evalcrafter: Benchmarking and evaluating large video generation models

    Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024

  30. [30]

    Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

  31. [31]

    Cat4d: Create anything in 4d with multi-view video diffusion models

    Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T Barron, and Aleksander Holynski. Cat4d: Create anything in 4d with multi-view video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26057–26068, 2025

  32. [32]

    Recammaster: Camera-controlled generative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

  33. [33]

    VGGT-Ω.arXiv preprint arXiv:2605.15195, 2026

    Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger, Patrick Labatut, Piotr Bojanowski, David Novotny, Andrea Vedaldi, and Christian Rupprecht. VGGT-Ω.arXiv preprint arXiv:2605.15195, 2026

  34. [34]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  35. [35]

    Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  36. [36]

    Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008

    Kilem Li Gwet. Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008

  37. [37]

    Computing krippendorff’s alpha-reliability

    Klaus Krippendorff. Computing krippendorff’s alpha-reliability. 2011

  38. [38]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 21 Preprint

  39. [39]

    Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

  40. [40]

    Out of sight but not out of mind: Hybrid memory for dynamic video world models

    Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, and Xiang Bai. Out of sight but not out of mind: Hybrid memory for dynamic video world models. arXiv preprint arXiv:2603.25716, 2026

  41. [41]

    Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026

    InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, et al. Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026

  42. [42]

    Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2(3):6, 2025

    Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2(3):6, 2025

  43. [43]

    Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

    Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

  44. [44]

    Unreported

    Guangyuan Li, Siming Zheng, Shuolin Xu, Jinwei Chen, Bo Li, Xiaobin Hu, Lei Zhao, and Peng-Tao Jiang. Magicworld: Interactive geometry-driven video world exploration.arXiv preprint arXiv:2511.18886, 2025. 22 Preprint. Appendix Table of contents: •§A: Use of Large Language Models (LLMs) •§B: Reproducibility and Metric–Dataset Records •§C: Limitations and F...

  45. [45]

    Source-only control.Controls are extracted from the source view or source video only, so the target-view endpoint is not supplied

  46. [46]

    Endpoint-masked target control.Target-view controls are allowed, but the object/contact endpoint region is masked or ambiguous

  47. [47]

    Reward work or policy training should follow the same separation

    Full target control.Target-view dense controls provide an upper-bound control-following condition, not evidence of internal world-state persistence. Reward work or policy training should follow the same separation. WRBench records can mine preference pairs over target-relative camera displacement, visual integrity, judgeable re-observation, and re-observe...