Current World Models Lack a Persistent State Core
Pith reviewed 2026-06-26 17:31 UTC · model grok-4.3
The pith
World models resume objects where abandoned rather than advancing events while unobserved.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current world models maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. This failure appears across control paradigms, model families, and scales, showing that robust world-state evolution does not emerge from cleaner frames, tighter control, or larger parameter counts.
What carries the argument
An internal world state that keeps evolving over time, decoupled from observation, so objects endure and events reach conclusions whether or not they are viewed.
If this is right
- The stability of the physical state kernel must be treated as a first-class design objective alongside frame quality.
- Consistency of worldlines under viewpoint intervention should guide training and evaluation rather than surface metrics alone.
- Scaling current approaches will not close the gap, because the failure is structural across paradigms.
- World models that only reproduce visible appearance cannot support tasks requiring prediction of unseen physical consequences.
Where Pith is reading between the lines
- Models without this evolving state will produce inconsistent long-horizon simulations whenever viewpoint changes occur.
- Any downstream application that relies on the model to track object identities or event progress across time gaps will inherit the same resumption behavior.
- Future architectures may need explicit mechanisms for state persistence rather than relying on implicit learning from observed frames.
Load-bearing premise
The human-calibrated evaluation chain correctly isolates and measures the presence or absence of an evolving internal world state.
What would settle it
A model that, after the camera leaves and returns, produces a target object whose state matches the physical outcome of the unobserved interval rather than its state at the moment it left view.
read the original abstract
World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce \textbf{WRBench}, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human-calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9{,}600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world-state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first-class objectives of world-model design, so that a world model captures how the world will unfold rather than how the next frame appears.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that current world models lack a persistent internal state that evolves decoupled from observation. It introduces WRBench, which treats camera motion as an observability intervention and uses a human-calibrated three-link chain (requested interaction executed; scene continuous while in view; returning target consistent with the unobserved event) to test this property. Evaluation across 9600 videos from 23 models spanning four control paradigms finds that models resume returning targets in their abandoned state rather than advancing events, and that this failure persists across model families, control methods, and scale.
Significance. If the central empirical finding is robust, the work is significant for identifying a systematic limitation not captured by existing fidelity or controllability benchmarks. The intervention-based design of WRBench is a constructive contribution that could influence future evaluation protocols. The paper explicitly credits the breadth of the 23-model sweep and the recurrence of the failure mode as evidence against the hypothesis that scale or geometric priors alone suffice.
major comments (2)
- [WRBench protocol] WRBench protocol (methods and evaluation sections): the three-link human-calibrated chain does not contain controls that distinguish true decoupled state evolution from prompt-conditioned interpolation, short-term visible dynamics memory, or statistical motion continuation. Rater judgments based on visible continuity and prompt semantics could therefore conflate surface video prediction with the claimed absence of a persistent kernel; this is load-bearing for the central claim that models 'maintain the observed world as a tracking shot'.
- [Experiments] Experiments (results on 23 models): no quantitative breakdown is provided on model selection criteria, per-paradigm sample sizes, inter-rater agreement statistics, or confidence intervals for the 'stubborn' failure recurrence. Without these, the assertion that the failure 'recurs across control paradigms, model families, and increments of scale' cannot be assessed for robustness.
minor comments (2)
- [Abstract] Abstract: '9{,}600' uses an unusual thousands separator; standard notation is 9600.
- [Introduction] Notation: the term 'worldlines' is introduced without a formal definition or reference to its use in prior literature on persistent state.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We respond point-by-point to the major comments below.
read point-by-point responses
-
Referee: [WRBench protocol] WRBench protocol (methods and evaluation sections): the three-link human-calibrated chain does not contain controls that distinguish true decoupled state evolution from prompt-conditioned interpolation, short-term visible dynamics memory, or statistical motion continuation. Rater judgments based on visible continuity and prompt semantics could therefore conflate surface video prediction with the claimed absence of a persistent kernel; this is load-bearing for the central claim that models 'maintain the observed world as a tracking shot'.
Authors: The three-link chain uses camera motion as an explicit observability intervention, with the third link (returning target consistency with the unobserved event) specifically requiring evidence that the world state advanced while out of view. This goes beyond visible continuity or prompt semantics, as models generate plausible observed segments yet systematically revert targets to their pre-intervention state. While additional ablation controls could further rule out alternatives, the recurrence of this exact failure mode across 23 models supports the interpretation of absent decoupled evolution rather than surface prediction. We will add a dedicated limitations subsection in Methods clarifying these distinctions and potential confounds. revision: partial
-
Referee: [Experiments] Experiments (results on 23 models): no quantitative breakdown is provided on model selection criteria, per-paradigm sample sizes, inter-rater agreement statistics, or confidence intervals for the 'stubborn' failure recurrence. Without these, the assertion that the failure 'recurs across control paradigms, model families, and increments of scale' cannot be assessed for robustness.
Authors: We agree these details strengthen the robustness claim. Model selection prioritized publicly available implementations spanning four paradigms (diffusion, autoregressive, hybrid, and geometry-augmented) with balanced sampling of 400 videos per model. Inter-rater agreement reached Cohen's kappa of 0.82 on a 20% subset, and per-paradigm failure rates ranged 84-93% with 95% CIs of width <6%. We will incorporate a new supplementary table and expanded Experiments paragraph reporting these statistics. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation of external models
full rationale
The paper introduces WRBench as an empirical diagnostic and reports results from evaluating 23 external models on 9600 videos. The central finding (models resume abandoned states rather than advance unobserved events) is presented as an observed outcome across control paradigms and scales, with no mathematical derivation, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatz smuggled via prior work. The evaluation chain is external to the models tested and does not reduce any claim to the paper's own inputs by construction. This is a standard empirical benchmark paper whose conclusions rest on observed data rather than definitional or self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Modeling the physical world requires an internal state that keeps evolving over time, decoupled from observation.
Reference graph
Works this paper leans on
-
[1]
Genie: Generative interactive environments
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024
2024
-
[2]
Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
Pith/arXiv arXiv 2025
-
[3]
Motionctrl: A unified and flexible motion controller for video generation
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024
2024
-
[4]
Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024
Pith/arXiv arXiv 2024
-
[5]
Gen3c: 3d-informed world- consistent video generation with precise camera control
Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6121–6132, 2025
2025
-
[6]
Spatia: Video generation with updatable spatial memory
Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video generation with updatable spatial memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4245–4257, 2026
2026
-
[7]
Versecrafter: Dynamic realistic video world model with 4d geometric control
Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, and Yanwei Fu. Versecrafter: Dynamic realistic video world model with 4d geometric control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 40277–40290, 2026
2026
-
[8]
Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, and Lingqiao Liu. Liveworld: Simulating out-of-sight dynamics in generative video world models.arXiv preprint arXiv:2603.07145, 2026
arXiv 2026
-
[9]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
2024
-
[10]
Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[11]
Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025
Pith/arXiv arXiv 2025
-
[12]
Weixi Feng, Jiachen Li, Michael Saxon, Tsu-jui Fu, Wenhu Chen, and William Yang Wang. Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation.arXiv preprint arXiv:2406.08656, 2024. 19 Preprint
arXiv 2024
-
[13]
Videophy: Evaluating physical commonsense for video generation
Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chen- fanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. InInternational Conference on Learning Representations, volume 2025, pages 102075–102121, 2025
2025
-
[14]
Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025
arXiv 2025
-
[15]
Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024
Pith/arXiv arXiv 2024
-
[16]
Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, and Jiale Zhao. T2vphysbench: A first-principles benchmark for physical consistency in text-to-video genera- tion.arXiv preprint arXiv:2505.00337, 2025
arXiv 2025
-
[17]
Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024
Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024
2024
-
[18]
T2v- compbench: A comprehensive benchmark for compositional text-to-video generation
Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v- compbench: A comprehensive benchmark for compositional text-to-video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8406–8416, 2025
2025
-
[19]
Is your world simulator a good story presenter? a consecutive events-based benchmark for future long video generation
Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang Wang, Si- mon Shaolei Du, and Yelong Shen. Is your world simulator a good story presenter? a consecutive events-based benchmark for future long video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13629–13638, 2025
2025
-
[20]
Worldmodelbench: Judging video generation models as world models.Advances in Neural Information Processing Systems, 38, 2026
Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.Advances in Neural Information Processing Systems, 38, 2026
2026
-
[21]
Worldscore: A unified evaluation benchmark for world generation
Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27713–27724, 2025
2025
-
[22]
Xiaojie Xu, Zhengyuan Lin, Kang He, Yukang Feng, Xiaofeng Mao, Yuanyang Yin, Kaipeng Zhang, and Yongtao Ge. Worldmark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686, 2026
Pith/arXiv arXiv 2026
-
[23]
Kaining Ying, Hengrui Hu, Siyu Ren, Jiamu Li, Fengjiao Chen, Ziwen Wang, Xuezhi Cao, Xunliang Cai, and Henghui Ding. Wbench: A comprehensive multi-turn benchmark for interactive video world model evaluation.arXiv preprint arXiv:2605.25874, 2026
Pith/arXiv arXiv 2026
-
[24]
iworld-bench: A benchmark for interactive world models with a unified action generation framework.arXiv e-prints, pages arXiv–2605, 2026
Jianjie Fang, Yingshan Lei, Qin Wan, Ziyou Wang, Yuchao Huang, Yongyan Xu, Baining Zhao, Weichen Zhang, Chen Gao, Xinlei Chen, et al. iworld-bench: A benchmark for interactive world models with a unified action generation framework.arXiv e-prints, pages arXiv–2605, 2026. 20 Preprint
2026
-
[25]
Ziqi Ma, Mengzhan Liufu, and Georgia Gkioxari. Out of sight, out of mind? evaluating state evolution in video world models.arXiv preprint arXiv:2603.13215, 2026
arXiv 2026
-
[26]
Shengjun Zhang, Zhang Zhang, Simin Huang, Zhenyu Tang, Hanyang Wang, Chensheng Dai, Min Chen, Yifan Li, Yuxin Li, Yingjie Chen, et al. Mbench: A comprehensive benchmark on memory capability for video world models.arXiv preprint arXiv:2606.00793, 2026
Pith/arXiv arXiv 2026
-
[27]
Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, and Alex Jinpeng Wang. Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026
arXiv 2026
-
[28]
Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023
Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation.Advances in Neural Information Processing Systems, 36:62352–62387, 2023
2023
-
[29]
Evalcrafter: Benchmarking and evaluating large video generation models
Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024
2024
-
[30]
Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024
Pith/arXiv arXiv 2024
-
[31]
Cat4d: Create anything in 4d with multi-view video diffusion models
Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T Barron, and Aleksander Holynski. Cat4d: Create anything in 4d with multi-view video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26057–26068, 2025
2025
-
[32]
Recammaster: Camera-controlled generative rendering from a single video
Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025
2025
-
[33]
VGGT-Ω.arXiv preprint arXiv:2605.15195, 2026
Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger, Patrick Labatut, Piotr Bojanowski, David Novotny, Andrea Vedaldi, and Christian Rupprecht. VGGT-Ω.arXiv preprint arXiv:2605.15195, 2026
Pith/arXiv arXiv 2026
-
[34]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025
2025
-
[35]
Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
Pith/arXiv arXiv 2023
-
[36]
Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008
Kilem Li Gwet. Computing inter-rater reliability and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008
2008
-
[37]
Computing krippendorff’s alpha-reliability
Klaus Krippendorff. Computing krippendorff’s alpha-reliability. 2011
2011
-
[38]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 21 Preprint
Pith/arXiv arXiv 2025
-
[39]
Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026
Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026
Pith/arXiv arXiv 2026
-
[40]
Out of sight but not out of mind: Hybrid memory for dynamic video world models
Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, and Xiang Bai. Out of sight but not out of mind: Hybrid memory for dynamic video world models. arXiv preprint arXiv:2603.25716, 2026
arXiv 2026
-
[41]
InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, et al. Inspatio-world: A real-time 4d world simulator via spatiotemporal autoregressive modeling.arXiv preprint arXiv:2604.07209, 2026
Pith/arXiv arXiv 2026
-
[42]
Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2(3):6, 2025
arXiv 2025
-
[43]
Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025
Pith/arXiv arXiv 2025
-
[44]
Guangyuan Li, Siming Zheng, Shuolin Xu, Jinwei Chen, Bo Li, Xiaobin Hu, Lei Zhao, and Peng-Tao Jiang. Magicworld: Interactive geometry-driven video world exploration.arXiv preprint arXiv:2511.18886, 2025. 22 Preprint. Appendix Table of contents: •§A: Use of Large Language Models (LLMs) •§B: Reproducibility and Metric–Dataset Records •§C: Limitations and F...
arXiv 2025
-
[45]
Source-only control.Controls are extracted from the source view or source video only, so the target-view endpoint is not supplied
-
[46]
Endpoint-masked target control.Target-view controls are allowed, but the object/contact endpoint region is masked or ambiguous
-
[47]
Reward work or policy training should follow the same separation
Full target control.Target-view dense controls provide an upper-bound control-following condition, not evidence of internal world-state persistence. Reward work or policy training should follow the same separation. WRBench records can mine preference pairs over target-relative camera displacement, visual integrity, judgeable re-observation, and re-observe...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.