Current World Models Lack a Persistent State Core

Dexu Zhu; Duyu Tang; Guo Tang; Haoyuan Shi; Jie Cao; Jinpeng Lu; Linghan Cai; Xiaozhu Ju; Yinda Chen; Yi Zhang

arxiv: 2606.20545 · v1 · pith:5NRSCT7Jnew · submitted 2026-06-18 · 💻 cs.CV

Current World Models Lack a Persistent State Core

Jinpeng Lu , Dexu Zhu , Haoyuan Shi , Linghan Cai , Guo Tang , Yinda Chen , Jie Cao , Duyu Tang

show 3 more authors

Yi Zhang Yong Dai Xiaozhu Ju

This is my paper

Pith reviewed 2026-06-26 17:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords world modelspersistent stateobservabilityvideo generationbenchmarkinternal statephysical simulationcamera intervention

0 comments

The pith

World models resume objects where abandoned rather than advancing events while unobserved.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that a functional world model must maintain an internal state that continues evolving even when no camera is present, so that objects and events proceed to their natural conclusions independently of observation. Existing systems instead behave like a tracking shot that freezes the world when the view is interrupted and simply resumes the last seen state upon return. To test this, the authors created a benchmark that treats camera motion as a controlled intervention on observability and checks three linked conditions: whether the requested camera action occurs, whether the visible scene remains continuous, and whether a returning target matches the physical evolution that should have occurred off-screen. Across thousands of videos from many models and control methods, the same pattern appears: the generated world does not advance during unobserved intervals. If this diagnosis holds, then scaling imagery quality or geometric priors alone will not produce models that capture how the world actually unfolds.

Core claim

Current world models maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. This failure appears across control paradigms, model families, and scales, showing that robust world-state evolution does not emerge from cleaner frames, tighter control, or larger parameter counts.

What carries the argument

An internal world state that keeps evolving over time, decoupled from observation, so objects endure and events reach conclusions whether or not they are viewed.

If this is right

The stability of the physical state kernel must be treated as a first-class design objective alongside frame quality.
Consistency of worldlines under viewpoint intervention should guide training and evaluation rather than surface metrics alone.
Scaling current approaches will not close the gap, because the failure is structural across paradigms.
World models that only reproduce visible appearance cannot support tasks requiring prediction of unseen physical consequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models without this evolving state will produce inconsistent long-horizon simulations whenever viewpoint changes occur.
Any downstream application that relies on the model to track object identities or event progress across time gaps will inherit the same resumption behavior.
Future architectures may need explicit mechanisms for state persistence rather than relying on implicit learning from observed frames.

Load-bearing premise

The human-calibrated evaluation chain correctly isolates and measures the presence or absence of an evolving internal world state.

What would settle it

A model that, after the camera leaves and returns, produces a target object whose state matches the physical outcome of the unobserved interval rather than its state at the moment it left view.

read the original abstract

World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce \textbf{WRBench}, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human-calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9{,}600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world-state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first-class objectives of world-model design, so that a world model captures how the world will unfold rather than how the next frame appears.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WRBench flags a real gap in world-model evaluation by testing unobserved state evolution, but the human chain may not cleanly separate it from surface prediction.

read the letter

The paper's main point is that existing world models treat the world like a paused video: when the camera leaves and returns, they resume the last observed frame instead of letting events continue. WRBench tries to catch this by forcing camera motion as an observability break and checking three links in a human-rated chain.

The benchmark itself is the clearest contribution. Using camera intervention to probe persistence is a direct way to test something prior benchmarks ignored. Running it on 23 models across four control types gives a broad sample, and the consistent failure mode across them is worth noting. The argument that scale and geometric priors do not fix the issue follows from the setup.

The soft spot is the evaluation chain. The stress-test concern holds: nothing in the described protocol rules out models that achieve consistency through prompt-driven interpolation or short-term motion statistics rather than a decoupled evolving kernel. Without the full methods on how raters are calibrated and how they distinguish these cases, the claim that models lack any internal state advancement rests on an assumption that may not be isolated. The abstract gives no numbers or variance, so the uniformity of the failure is hard to gauge.

This is useful for groups building world models aimed at planning or long-horizon tasks. It deserves peer review because the diagnostic idea is new and the empirical scope is decent, even if the isolation of the failure mode needs tighter controls.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that current world models lack a persistent internal state that evolves decoupled from observation. It introduces WRBench, which treats camera motion as an observability intervention and uses a human-calibrated three-link chain (requested interaction executed; scene continuous while in view; returning target consistent with the unobserved event) to test this property. Evaluation across 9600 videos from 23 models spanning four control paradigms finds that models resume returning targets in their abandoned state rather than advancing events, and that this failure persists across model families, control methods, and scale.

Significance. If the central empirical finding is robust, the work is significant for identifying a systematic limitation not captured by existing fidelity or controllability benchmarks. The intervention-based design of WRBench is a constructive contribution that could influence future evaluation protocols. The paper explicitly credits the breadth of the 23-model sweep and the recurrence of the failure mode as evidence against the hypothesis that scale or geometric priors alone suffice.

major comments (2)

[WRBench protocol] WRBench protocol (methods and evaluation sections): the three-link human-calibrated chain does not contain controls that distinguish true decoupled state evolution from prompt-conditioned interpolation, short-term visible dynamics memory, or statistical motion continuation. Rater judgments based on visible continuity and prompt semantics could therefore conflate surface video prediction with the claimed absence of a persistent kernel; this is load-bearing for the central claim that models 'maintain the observed world as a tracking shot'.
[Experiments] Experiments (results on 23 models): no quantitative breakdown is provided on model selection criteria, per-paradigm sample sizes, inter-rater agreement statistics, or confidence intervals for the 'stubborn' failure recurrence. Without these, the assertion that the failure 'recurs across control paradigms, model families, and increments of scale' cannot be assessed for robustness.

minor comments (2)

[Abstract] Abstract: '9{,}600' uses an unusual thousands separator; standard notation is 9600.
[Introduction] Notation: the term 'worldlines' is introduced without a formal definition or reference to its use in prior literature on persistent state.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [WRBench protocol] WRBench protocol (methods and evaluation sections): the three-link human-calibrated chain does not contain controls that distinguish true decoupled state evolution from prompt-conditioned interpolation, short-term visible dynamics memory, or statistical motion continuation. Rater judgments based on visible continuity and prompt semantics could therefore conflate surface video prediction with the claimed absence of a persistent kernel; this is load-bearing for the central claim that models 'maintain the observed world as a tracking shot'.

Authors: The three-link chain uses camera motion as an explicit observability intervention, with the third link (returning target consistency with the unobserved event) specifically requiring evidence that the world state advanced while out of view. This goes beyond visible continuity or prompt semantics, as models generate plausible observed segments yet systematically revert targets to their pre-intervention state. While additional ablation controls could further rule out alternatives, the recurrence of this exact failure mode across 23 models supports the interpretation of absent decoupled evolution rather than surface prediction. We will add a dedicated limitations subsection in Methods clarifying these distinctions and potential confounds. revision: partial
Referee: [Experiments] Experiments (results on 23 models): no quantitative breakdown is provided on model selection criteria, per-paradigm sample sizes, inter-rater agreement statistics, or confidence intervals for the 'stubborn' failure recurrence. Without these, the assertion that the failure 'recurs across control paradigms, model families, and increments of scale' cannot be assessed for robustness.

Authors: We agree these details strengthen the robustness claim. Model selection prioritized publicly available implementations spanning four paradigms (diffusion, autoregressive, hybrid, and geometry-augmented) with balanced sampling of 400 videos per model. Inter-rater agreement reached Cohen's kappa of 0.82 on a 20% subset, and per-paradigm failure rates ranged 84-93% with 95% CIs of width <6%. We will incorporate a new supplementary table and expanded Experiments paragraph reporting these statistics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation of external models

full rationale

The paper introduces WRBench as an empirical diagnostic and reports results from evaluating 23 external models on 9600 videos. The central finding (models resume abandoned states rather than advance unobserved events) is presented as an observed outcome across control paradigms and scales, with no mathematical derivation, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatz smuggled via prior work. The evaluation chain is external to the models tested and does not reduce any claim to the paper's own inputs by construction. This is a standard empirical benchmark paper whose conclusions rest on observed data rather than definitional or self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that a proper world model must maintain an evolving internal state decoupled from observation; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Modeling the physical world requires an internal state that keeps evolving over time, decoupled from observation.
Presented in the opening paragraph as the core requirement that existing benchmarks ignore.

pith-pipeline@v0.9.1-grok · 5853 in / 1314 out tokens · 21251 ms · 2026-06-26T17:31:57.918972+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 14 linked inside Pith

[1]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

2024
[2]

Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025
[3]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

2024
[4]

Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

Pith/arXiv arXiv 2024
[5]

Gen3c: 3d-informed world- consistent video generation with precise camera control

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6121–6132, 2025

2025
[6]

Spatia: Video generation with updatable spatial memory

Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4245–4257, 2026

2026
[7]

Versecrafter: Dynamic realistic video world model with 4d geometric control

Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, and Yanwei Fu. Versecrafter: Dynamic realistic video world model with 4d geometric control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 40277–40290, 2026

2026
[8]

Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, and Lingqiao Liu. Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

arXiv 2026
[9]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024
[10]

Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[11]

Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

Pith/arXiv arXiv 2025
[12]

Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation.arXiv preprint arXiv:2406.08656, 2024

Weixi Feng, Jiachen Li, Michael Saxon, Tsu-jui Fu, Wenhu Chen, and William Yang Wang. Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation.arXiv preprint arXiv:2406.08656, 2024. 19 Preprint

arXiv 2024
[13]

Videophy: Evaluating physical commonsense for video generation

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chen- fanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. InInternational Conference on Learning Representations, volume 2025, pages 102075–102121, 2025

2025
[14]

Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

arXiv 2025
[15]

Towards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024

Pith/arXiv arXiv 2024
[16]

T2vphysbench: A first-principles benchmark for physical consistency in text-to-video genera- tion.arXiv preprint arXiv:2505.00337, 2025

Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, and Jiale Zhao. T2vphysbench: A first-principles benchmark for physical consistency in text-to-video genera- tion.arXiv preprint arXiv:2505.00337, 2025

arXiv 2025
[17]

Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024

Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024

2024
[18]

T2v- compbench: A comprehensive benchmark for compositional text-to-video generation

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v- compbench: A comprehensive benchmark for compositional text-to-video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025

2025
[19]

Is your world simulator a good story presenter? a consecutive events-based benchmark for future long video generation

Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang Wang, Si- mon Shaolei Du, and Yelong Shen. Is your world simulator a good story presenter? a consecutive events-based benchmark for future long video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13629–13638, 2025

2025
[20]

Worldmodelbench: Judging video generation models as world models.Advances in Neural Information Processing Systems, 38, 2026

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.Advances in Neural Information Processing Systems, 38, 2026

2026
[21]

Worldscore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025

2025
[22]

Worldmark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686, 2026

Xiaojie Xu, Zhengyuan Lin, Kang He, Yukang Feng, Xiaofeng Mao, Yuanyang Yin, Kaipeng Zhang, and Yongtao Ge. Worldmark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686, 2026

Pith/arXiv arXiv 2026
[23]

Wbench: A comprehensive multi-turn benchmark for interactive video world model evaluation.arXiv preprint arXiv:2605.25874, 2026

Kaining Ying, Hengrui Hu, Siyu Ren, Jiamu Li, Fengjiao Chen, Ziwen Wang, Xuezhi Cao, Xunliang Cai, and Henghui Ding. Wbench: A comprehensive multi-turn benchmark for interactive video world model evaluation.arXiv preprint arXiv:2605.25874, 2026

Pith/arXiv arXiv 2026
[24]

iworld-bench: A benchmark for interactive world models with a unified action generation framework.arXiv e-prints, pages arXiv–2605, 2026

Jianjie Fang, Yingshan Lei, Qin Wan, Ziyou Wang, Yuchao Huang, Yongyan Xu, Baining Zhao, Weichen Zhang, Chen Gao, Xinlei Chen, et al. iworld-bench: A benchmark for interactive world models with a unified action generation framework.arXiv e-prints, pages arXiv–2605, 2026. 20 Preprint

2026
[25]

Out of sight, out of mind? evaluating state evolution in video world models.arXiv preprint arXiv:2603.13215, 2026

Ziqi Ma, Mengzhan Liufu, and Georgia Gkioxari. Out of sight, out of mind? evaluating state evolution in video world models.arXiv preprint arXiv:2603.13215, 2026

arXiv 2026
[26]

Mbench: A comprehensive benchmark on memory capability for video world models.arXiv preprint arXiv:2606.00793, 2026

Shengjun Zhang, Zhang Zhang, Simin Huang, Zhenyu Tang, Hanyang Wang, Chensheng Dai, Min Chen, Yifan Li, Yuxin Li, Yingjie Chen, et al. Mbench: A comprehensive benchmark on memory capability for video world models.arXiv preprint arXiv:2606.00793, 2026

Pith/arXiv arXiv 2026
[27]

Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, and Alex Jinpeng Wang. Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

arXiv 2026
[28]

Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023

Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023

2023
[29]

Evalcrafter: Benchmarking and evaluating large video generation models

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024

2024
[30]

Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

Pith/arXiv arXiv 2024
[31]

Cat4d: Create anything in 4d with multi-view video diffusion models

Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T Barron, and Aleksander Holynski. Cat4d: Create anything in 4d with multi-view video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26057–26068, 2025

2025
[32]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

2025
[33]

VGGT-Ω.arXiv preprint arXiv:2605.15195, 2026

Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger, Patrick Labatut, Piotr Bojanowski, David Novotny, Andrea Vedaldi, and Christian Rupprecht. VGGT-Ω.arXiv preprint arXiv:2605.15195, 2026

Pith/arXiv arXiv 2026
[34]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025
[35]

Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023
[36]

Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008

Kilem Li Gwet. Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008

2008
[37]

Computing krippendorff’s alpha-reliability

Klaus Krippendorff. Computing krippendorff’s alpha-reliability. 2011

2011
[38]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 21 Preprint

Pith/arXiv arXiv 2025
[39]

Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Pith/arXiv arXiv 2026
[40]

Out of sight but not out of mind: Hybrid memory for dynamic video world models

Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, and Xiang Bai. Out of sight but not out of mind: Hybrid memory for dynamic video world models. arXiv preprint arXiv:2603.25716, 2026

arXiv 2026
[41]

Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026

InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, et al. Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026

Pith/arXiv arXiv 2026
[42]

Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2(3):6, 2025

Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2(3):6, 2025

arXiv 2025
[43]

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Pith/arXiv arXiv 2025
[44]

Unreported

Guangyuan Li, Siming Zheng, Shuolin Xu, Jinwei Chen, Bo Li, Xiaobin Hu, Lei Zhao, and Peng-Tao Jiang. Magicworld: Interactive geometry-driven video world exploration.arXiv preprint arXiv:2511.18886, 2025. 22 Preprint. Appendix Table of contents: •§A: Use of Large Language Models (LLMs) •§B: Reproducibility and Metric–Dataset Records •§C: Limitations and F...

arXiv 2025
[45]

Source-only control.Controls are extracted from the source view or source video only, so the target-view endpoint is not supplied
[46]

Endpoint-masked target control.Target-view controls are allowed, but the object/contact endpoint region is masked or ambiguous
[47]

Reward work or policy training should follow the same separation

Full target control.Target-view dense controls provide an upper-bound control-following condition, not evidence of internal world-state persistence. Reward work or policy training should follow the same separation. WRBench records can mine preference pairs over target-relative camera displacement, visual integrity, judgeable re-observation, and re-observe...

[1] [1]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

2024

[2] [2]

Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025

[3] [3]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

2024

[4] [4]

Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

Pith/arXiv arXiv 2024

[5] [5]

Gen3c: 3d-informed world- consistent video generation with precise camera control

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6121–6132, 2025

2025

[6] [6]

Spatia: Video generation with updatable spatial memory

Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4245–4257, 2026

2026

[7] [7]

Versecrafter: Dynamic realistic video world model with 4d geometric control

Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, and Yanwei Fu. Versecrafter: Dynamic realistic video world model with 4d geometric control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 40277–40290, 2026

2026

[8] [8]

Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, and Lingqiao Liu. Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026

arXiv 2026

[9] [9]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024

[10] [10]

Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[11] [11]

Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

Pith/arXiv arXiv 2025

[12] [12]

Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation.arXiv preprint arXiv:2406.08656, 2024

Weixi Feng, Jiachen Li, Michael Saxon, Tsu-jui Fu, Wenhu Chen, and William Yang Wang. Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation.arXiv preprint arXiv:2406.08656, 2024. 19 Preprint

arXiv 2024

[13] [13]

Videophy: Evaluating physical commonsense for video generation

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chen- fanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. InInternational Conference on Learning Representations, volume 2025, pages 102075–102121, 2025

2025

[14] [14]

Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

arXiv 2025

[15] [15]

Towards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024

Pith/arXiv arXiv 2024

[16] [16]

T2vphysbench: A first-principles benchmark for physical consistency in text-to-video genera- tion.arXiv preprint arXiv:2505.00337, 2025

Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, and Jiale Zhao. T2vphysbench: A first-principles benchmark for physical consistency in text-to-video genera- tion.arXiv preprint arXiv:2505.00337, 2025

arXiv 2025

[17] [17]

Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024

Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024

2024

[18] [18]

T2v- compbench: A comprehensive benchmark for compositional text-to-video generation

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v- compbench: A comprehensive benchmark for compositional text-to-video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025

2025

[19] [19]

Is your world simulator a good story presenter? a consecutive events-based benchmark for future long video generation

Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang Wang, Si- mon Shaolei Du, and Yelong Shen. Is your world simulator a good story presenter? a consecutive events-based benchmark for future long video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13629–13638, 2025

2025

[20] [20]

Worldmodelbench: Judging video generation models as world models.Advances in Neural Information Processing Systems, 38, 2026

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.Advances in Neural Information Processing Systems, 38, 2026

2026

[21] [21]

Worldscore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025

2025

[22] [22]

Worldmark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686, 2026

Xiaojie Xu, Zhengyuan Lin, Kang He, Yukang Feng, Xiaofeng Mao, Yuanyang Yin, Kaipeng Zhang, and Yongtao Ge. Worldmark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686, 2026

Pith/arXiv arXiv 2026

[23] [23]

Wbench: A comprehensive multi-turn benchmark for interactive video world model evaluation.arXiv preprint arXiv:2605.25874, 2026

Kaining Ying, Hengrui Hu, Siyu Ren, Jiamu Li, Fengjiao Chen, Ziwen Wang, Xuezhi Cao, Xunliang Cai, and Henghui Ding. Wbench: A comprehensive multi-turn benchmark for interactive video world model evaluation.arXiv preprint arXiv:2605.25874, 2026

Pith/arXiv arXiv 2026

[24] [24]

iworld-bench: A benchmark for interactive world models with a unified action generation framework.arXiv e-prints, pages arXiv–2605, 2026

Jianjie Fang, Yingshan Lei, Qin Wan, Ziyou Wang, Yuchao Huang, Yongyan Xu, Baining Zhao, Weichen Zhang, Chen Gao, Xinlei Chen, et al. iworld-bench: A benchmark for interactive world models with a unified action generation framework.arXiv e-prints, pages arXiv–2605, 2026. 20 Preprint

2026

[25] [25]

Out of sight, out of mind? evaluating state evolution in video world models.arXiv preprint arXiv:2603.13215, 2026

Ziqi Ma, Mengzhan Liufu, and Georgia Gkioxari. Out of sight, out of mind? evaluating state evolution in video world models.arXiv preprint arXiv:2603.13215, 2026

arXiv 2026

[26] [26]

Mbench: A comprehensive benchmark on memory capability for video world models.arXiv preprint arXiv:2606.00793, 2026

Shengjun Zhang, Zhang Zhang, Simin Huang, Zhenyu Tang, Hanyang Wang, Chensheng Dai, Min Chen, Yifan Li, Yuxin Li, Yingjie Chen, et al. Mbench: A comprehensive benchmark on memory capability for video world models.arXiv preprint arXiv:2606.00793, 2026

Pith/arXiv arXiv 2026

[27] [27]

Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, and Alex Jinpeng Wang. Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

arXiv 2026

[28] [28]

Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023

Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023

2023

[29] [29]

Evalcrafter: Benchmarking and evaluating large video generation models

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024

2024

[30] [30]

Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

Pith/arXiv arXiv 2024

[31] [31]

Cat4d: Create anything in 4d with multi-view video diffusion models

Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T Barron, and Aleksander Holynski. Cat4d: Create anything in 4d with multi-view video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26057–26068, 2025

2025

[32] [32]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

2025

[33] [33]

VGGT-Ω.arXiv preprint arXiv:2605.15195, 2026

Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger, Patrick Labatut, Piotr Bojanowski, David Novotny, Andrea Vedaldi, and Christian Rupprecht. VGGT-Ω.arXiv preprint arXiv:2605.15195, 2026

Pith/arXiv arXiv 2026

[34] [34]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025

[35] [35]

Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023

[36] [36]

Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008

Kilem Li Gwet. Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008

2008

[37] [37]

Computing krippendorff’s alpha-reliability

Klaus Krippendorff. Computing krippendorff’s alpha-reliability. 2011

2011

[38] [38]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 21 Preprint

Pith/arXiv arXiv 2025

[39] [39]

Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

Pith/arXiv arXiv 2026

[40] [40]

Out of sight but not out of mind: Hybrid memory for dynamic video world models

Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, and Xiang Bai. Out of sight but not out of mind: Hybrid memory for dynamic video world models. arXiv preprint arXiv:2603.25716, 2026

arXiv 2026

[41] [41]

Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026

InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, et al. Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026

Pith/arXiv arXiv 2026

[42] [42]

Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2(3):6, 2025

Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2(3):6, 2025

arXiv 2025

[43] [43]

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

Pith/arXiv arXiv 2025

[44] [44]

Unreported

Guangyuan Li, Siming Zheng, Shuolin Xu, Jinwei Chen, Bo Li, Xiaobin Hu, Lei Zhao, and Peng-Tao Jiang. Magicworld: Interactive geometry-driven video world exploration.arXiv preprint arXiv:2511.18886, 2025. 22 Preprint. Appendix Table of contents: •§A: Use of Large Language Models (LLMs) •§B: Reproducibility and Metric–Dataset Records •§C: Limitations and F...

arXiv 2025

[45] [45]

Source-only control.Controls are extracted from the source view or source video only, so the target-view endpoint is not supplied

[46] [46]

Endpoint-masked target control.Target-view controls are allowed, but the object/contact endpoint region is masked or ambiguous

[47] [47]

Reward work or policy training should follow the same separation

Full target control.Target-view dense controls provide an upper-bound control-following condition, not evidence of internal world-state persistence. Reward work or policy training should follow the same separation. WRBench records can mine preference pairs over target-relative camera displacement, visual integrity, judgeable re-observation, and re-observe...