pith. sign in

arxiv: 2606.31734 · v1 · pith:7TBGVBXCnew · submitted 2026-06-30 · 💻 cs.CV

MemLearner: Learning to Query Context memory for Video World Models

Pith reviewed 2026-07-01 05:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords video world modelscontext memoryquery tokensscene consistencyocclusion handlingvideo generationadaptive context retrievalmulti-dataset training
0
0 comments X

The pith

Learned query tokens let video world models maintain scene consistency over long sequences with occlusions and motion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method called MemLearner to solve the memory problem in video world models, where generated scenes become inconsistent after many frames. It replaces rule-based frame retrieval with query tokens that learn to pull relevant past context directly from the pre-trained generation model. This uses the model's existing visual knowledge without adding new modules trained from scratch and trains on a mix of rendered and real videos that include occlusions and moving objects. A reader would care because reliable long-horizon video prediction matters for interactive simulation and planning tasks.

Core claim

MemLearner is a learning-based adaptive context query method that inserts query tokens to bridge stored context frames and newly predicted tokens. By letting the video generation model itself perform the querying, the approach exploits pre-trained visual priors and avoids training extra modules from scratch. Training uses a multi-dataset strategy on long videos with camera pose annotations, combining annotated rendered sequences and unannotated real-world footage. Experiments show this yields stronger scene consistency and memory than prior rule-based retrieval, especially when objects are occluded or moving.

What carries the argument

Query tokens inserted to bridge context frames and predicted tokens, allowing the pre-trained model to learn adaptive retrieval of relevant history.

If this is right

  • Scene consistency improves in videos with frequent occlusions and dynamic objects.
  • Training succeeds on mixed datasets of annotated rendered videos and unannotated real videos.
  • No new modules need training from scratch because the existing generation model supplies the priors.
  • Efficient training and inference strategies become possible once query tokens replace manual retrieval rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token-based querying idea could transfer to other sequence models that need long-term memory without extra retrieval networks.
  • Performance on very long horizons beyond the training lengths would test whether the learned queries scale without drift.
  • Combining the query mechanism with explicit action inputs might further stabilize predictions in interactive settings.
  • The multi-dataset strategy suggests similar gains are possible in other domains that mix synthetic and real sequential data.

Load-bearing premise

That the pre-trained video generation model already holds enough visual knowledge for its own query tokens to retrieve useful context more effectively than any fixed rule-based method.

What would settle it

Run both MemLearner and a strong rule-based baseline on a held-out set of long videos containing repeated full occlusions of moving objects and measure whether object identities and layouts remain consistent for more than 30 seconds.

Figures

Figures reproduced from arXiv: 2606.31734 by Jianhong Bai, Jianxiong Gao, Jiwen Yu, Kaiyi Huang, Kun Gai, Pengfei Wan, Quande Liu, Xihui Liu, Xintao Wang, Yiran Qin.

Figure 1
Figure 1. Figure 1: Teaser Demonstration. We propose a novel memory mechanism for video world models through learning-based adaptive context querying. Compared to prior rule-based context retrieval methods [36, 58, 66], our approach handles scenes with occlusions and dynamic objects. This figure highlights dynamic objects, representative occluders, video generation trajectory, and key frames. Abstract. Video World Models are … view at source ↗
Figure 2
Figure 2. Figure 2: Architecture clarification. (a) Interaction mechanism among C, Q, and P tokens; (b)&(c) Two designs for context querying, where the alternative design in (b) fails in experiments (Sec. 5.4), and the adopted design in (c) leverages the prior knowledge of the video generation model itself and performs effectively. video generation model ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model Architecture. Our video generation model adopts a Diffusion Trans￾former. We concatenate C, Q, P tokens for context-conditioned long video generation. An optional camera encoder supports interactive control. 2.3 Learnable Query Tokens in Various Domains Learnable query tokens have been widely adopted for compressing visual con￾text across various domains: In multimodal language models, Perceiver Resa… view at source ↗
Figure 4
Figure 4. Figure 4: Efficiency Strategies. (a) Strategy#1: Context querying in shallow Query Layers with C, Q, P tokens; deep Generative Layers use only Q, P tokens. (b) Strat￾egy#2: Remove unnecessary attention computation for improved efficiency. where θ denotes all learnable parameters. Note that supervision is applied only to the noise predicted on predicted tokens. We concatenate C, Q and P along the frame dimension and … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Results. MemLearner effectively handles both indoor (a, b) and outdoor (c, [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Real-World Qualitative Results. By incorporating real-world videos into training dataset, MemLearner generalizes to real-world scenes. MemLearner consistently outperforms baselines across all three evaluation set￾tings. We further provide a user study in Appendix C.4, in which MemLearner is preferred over other SOTAs in terms of visual quality and scene consistency. 5.3 Qualitative Results We present quali… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Comparison. MemLearner achieves optimal memory and vi￾sual quality, demonstrating learning-based adaptive context query effectiveness. Other methods show inconsistent memory performance [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Attention computation settings for ablation study in Tab. 4. Performance impact of attention scope when Q tokens act as queries [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overview of the base text-to-video generation model. C Supplementary Experimental Results C.1 Results on Open-Source Video Models To verify that MemLearner generalizes across different models, we apply our method to Wan 2.1 (T2V-1.3B) [52], an open-source video Diffusion Trans￾former [43, 44]. Wan 2.1 shares a similar architecture with our internal model: each Transformer block contains a 3D attention modu… view at source ↗
Figure 10
Figure 10. Figure 10: Pseudocode for 3D attention computation. Generative Layers process P and Q tokens, while Query Layers process all C, Q, P tokens with different attention patterns. Adaptive Attention Across Frames and Timesteps. As shown in the left half of [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Attention visualization results. C.3 Evaluation on Real-World SpatialVID Dataset To evaluate generalization to real-world scenarios, we conduct quantitative ex￾periments on the SpatialVID dataset [53]. We select SpatialVID for two reasons: (1) it contains long videos with sequences up to 900 frames, and (2) it pro￾vides per-frame camera pose annotations, enabling controlled evaluation. We randomly sample … view at source ↗
Figure 12
Figure 12. Figure 12: Additional attention computation settings. Horizontal tokens serve as queries; vertical tokens serve as keys/values. GT Comp. Revisit Comp. Setting PSNR↑ LPIPS↓ PSNR↑ LPIPS↓ Speed (fps)↑ (a)*5+(b)*23 21.23 0.2904 18.57 0.3230 0.54 (c)*5+(b)*23 21.15 0.2976 18.43 0.3291 0.46 (c)*28 21.27 0.2914 18.62 0.3212 0.24 (d)*28 21.34 0.2895 18.67 0.3227 0.28 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Additional Dataset Ablation Results. “Rendered” denotes results trained only on rendered videos, while “+Real” indicates results after adding real-world videos to the training set [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional Qualitative Comparison Results [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
read the original abstract

Video World Models are interactive video generation models that predict future world states based on user actions and history video frames. A critical challenge in video world models is the lack of memory, causing inconsistent generated scenes over extended durations. Previous methods explored rule-based context frame retrieval as memory, but they fail to generalize in scenarios with scene occlusions and dynamic objects. We propose MemLearner, a learning-based adaptive context query method using query tokens to bridge context and predicted tokens. By leveraging the video generation model itself for context querying, MemLearner exploits pre-trained visual priors without training additional modules from scratch, and incorporates efficient strategies for training and inference. We collect a dataset of long videos with scene occlusions and dynamic objects, paired with camera pose annotations, and propose a multi-dataset training strategy leveraging both annotated rendered and unannotated real-world videos. Extensive experiments demonstrate that MemLearner significantly outperforms prior video world models in terms of scene consistency and memory, particularly under challenging occlusion and dynamic scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes MemLearner, a learning-based method that inserts query tokens into a frozen pre-trained video generator to perform adaptive context retrieval as memory for video world models. It collects a dataset of long videos with occlusions and dynamic objects (with camera poses), uses a multi-dataset training regime on both rendered and real videos, and claims this yields better scene consistency than rule-based retrieval without training new modules from scratch.

Significance. If the empirical claims hold, the approach would demonstrate that query-token mechanisms can exploit pre-trained visual priors for memory in long-horizon video generation, offering a lightweight alternative to rule-based or fully retrained memory modules under occlusion and dynamics.

major comments (1)
  1. [Abstract] Abstract: the central claim that MemLearner 'significantly outperforms prior video world models in terms of scene consistency and memory' is asserted without any metrics, baselines, ablation results, or experimental details, so the primary empirical contribution cannot be assessed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their feedback on the abstract. We address the comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that MemLearner 'significantly outperforms prior video world models in terms of scene consistency and memory' is asserted without any metrics, baselines, ablation results, or experimental details, so the primary empirical contribution cannot be assessed.

    Authors: We agree that the abstract presents the performance claim without supporting quantitative details, which limits immediate assessment of the empirical contribution. The full manuscript contains the requested elements (metrics, baselines, ablations) in the Experiments section. We will revise the abstract to include key quantitative results and baseline comparisons that substantiate the claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical ML architecture (query tokens inserted into a frozen pre-trained video generator) trained on a newly collected multi-dataset regime of long occluded/dynamic videos. No derivation chain, equations, or first-principles result is presented that reduces a claimed prediction to a quantity defined by the method's own fitted parameters. Performance claims rest on experimental comparisons rather than self-referential definitions or load-bearing self-citations. The construction is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Information is limited to the abstract; no equations or implementation details are available to identify fitted parameters or additional entities.

axioms (2)
  • domain assumption Video world models lack memory causing inconsistent generated scenes over extended durations
    Explicitly stated as the critical challenge in the abstract.
  • domain assumption Rule-based context frame retrieval fails to generalize in scenarios with scene occlusions and dynamic objects
    Stated as the limitation of previous methods in the abstract.
invented entities (1)
  • query tokens no independent evidence
    purpose: To bridge context and predicted tokens for adaptive context querying
    Introduced as the core mechanism of MemLearner in the abstract.

pith-pipeline@v0.9.1-grok · 5730 in / 1243 out tokens · 31421 ms · 2026-07-01T05:28:58.700454+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 46 canonical work pages · 23 internal anchors

  1. [1]

    ai, S., Teng, H., Jia, H., Sun, L., Li, L., Li, M., Tang, M., Han, S., Zhang, T., Zhang, W.Q., Luo, W., Kang, X., Sun, Y., Cao, Y., Huang, Y., Lin, Y., Fang, Y., Tao, Z., Zhang, Z., Wang, Z., Liu, Z., Shi, D., Su, G., Sun, H., Pan, H., Wang, J., Sheng, J., Cui, M., Hu, M., Yan, M., Yin, S., Zhang, S., Liu, T., Yin, X., Yang, X., Song, X., Hu, X., Zhang, Y...

  2. [2]

    Advances in neural information processing systems35, 23716– 23736 (2022)

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022)

  3. [3]

    arXiv preprint arXiv:2507.02001 (2025)

    Arnab, A., Iscen, A., Caron, M., Fathi, A., Schmid, C.: Temporal chain of thought: Long-video understanding by thinking in frames. arXiv preprint arXiv:2507.02001 (2025)

  4. [4]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

  5. [5]

    arXiv preprint arXiv:2503.11647 (2025)

    Bai, J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H.,Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. arXiv preprint arXiv:2503.11647 (2025)

  6. [6]

    Yu et al

    Ball, P.J., Bauer, J., Belletti, F., Brownfield, B., Ephrat, A., Fruchter, S., Gupta, A., Holsheimer, K., Holynski, A., Hron, J., Kaplanis, C., Limont, M., McGill, M., Oliveira, Y., Parker-Holder, J., Perbet, F., Scully, G., Shar, J., Spencer, S., Tov, O., Villegas, R., Wang, E., Yung, J., Baetu, C., Berbel, J., Bridson, D., Bruce, J., Buttimore, G., Chak...

  7. [7]

    org/abs/2405.04233

    Bao, F., Xiang, C., Yue, G., He, G., Zhu, H., Zheng, K., Zhao, M., Liu, S., Wang, Y., Zhu, J.: Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. arXiv preprint arXiv:2405.04233 (2024)

  8. [8]

    Advances in neural information processing systems33, 1877–1901 (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

  9. [9]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Buch, S., Nagrani, A., Arnab, A., Schmid, C.: Flexible frame selection for efficient video reasoning. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29071–29082 (2025)

  10. [10]

    In: arXiv (2025)

    Cai, S., Yang, C., Zhang, L., Guo, Y., Xiao, J., Yang, Z., Xu, Y., Yang, Z., Yuille, A., Guibas, L., Agrawala, M., Jiang, L., Wetzstein, G.: Mixture of contexts for long video generation. In: arXiv (2025)

  11. [11]

    arXiv preprint arXiv:2407.01392 (2024)

    Chen, B., Monso, D.M., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Dif- fusion forcing: Next-token prediction meets full-sequence diffusion. arXiv preprint arXiv:2407.01392 (2024)

  12. [12]

    VRAG: Learning World Models for Interactive Video Generation

    Chen, T., Hu, X., Ding, Z., Jin, C.: Learning world models for interactive video generation. arXiv preprint arXiv:2505.21996 (2025)

  13. [13]

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)

  14. [14]

    Cui, Y., Chen, H., Deng, H., Huang, X., Li, X., Liu, J., Liu, Y., Luo, Z., Wang, J., Wang, W., Wang, Y., Wang, C., Zhang, F., Zhao, Y., Pan, T., Li, X., Hao, Z., Ma, W., Chen, Z., Ao, Y., Huang, T., Wang, Z., Wang, X.: Emu3.5: Native multimodal models are world learners (2025),https://arxiv.org/abs/2510.26583

  15. [15]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

    Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., Wray, M.: Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

  16. [16]

    Decart, E.: Oasis: A universe in a transformer.https://oasis-model.github.io/ (2024)

  17. [17]

    DeepMind, G.: Veo 2: Our state-of-the-art video generation model.https:// deepmind.google/technologies/veo/veo-2/(2024)

  18. [18]

    Autoregressive Video Generation without Vector Quantization

    Deng, H., Pan, T., Diao, H., Luo, Z., Cui, Y., Lu, H., Shan, S., Qi, Y., Wang, X.: Autoregressive video generation without vector quantization. arXiv preprint arXiv:2412.14169 (2024)

  19. [19]

    In: ICLR (2025)

    Fu, X., Liu, X., Wang, X., Peng, S., Xia, M., Shi, X., Yuan, Z., Wan, P., Zhang, D., Lin, D.: 3dtrajmaster: Mastering 3d trajectory for multi-entity motion in video generation. In: ICLR (2025)

  20. [20]

    Long-Context Autoregressive Video Modeling with Next-Frame Prediction

    Gu, Y., Mao, W., Shou, M.Z.: Long-context autoregressive video modeling with next-frame prediction. arXiv preprint arXiv:2503.19325 (2025)

  21. [21]

    arXiv preprint arXiv:2503.10589 (2025)

    Guo, Y., Yang, C., Yang, Z., Ma, Z., Lin, Z., Yang, Z., Lin, D., Jiang, L.: Long context tuning for video generation. arXiv preprint arXiv:2503.10589 (2025)

  22. [22]

    Ha, D., Schmidhuber, J.: Recurrent world models facilitate policy evolution31 (2018)

  23. [23]

    He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- ablingcameracontrolfortext-to-videogeneration.arXivpreprintarXiv:2404.02101 (2024)

  24. [24]

    Classifier-Free Diffusion Guidance

    Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022) MemLearner: Learning to Query Context Memory 17

  25. [25]

    Relic: Interac- tive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

    Hong, Y., Mei, Y., Ge, C., Xu, Y., Zhou, Y., Bi, S., Hold-Geoffroy, Y., Roberts, M., Fisher, M., Shechtman, E., et al.: Relic: Interactive video world model with long-horizon memory. arXiv preprint arXiv:2512.04040 (2025)

  26. [26]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

  27. [27]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  28. [28]

    Nature638(8051), 656–663 (2025)

    Kanervisto, A., Bignell, D., Wen, L.Y., Grayson, M., Georgescu, R., Valcar- cel Macua, S., Tan, S.Z., Rashid, T., Pearce, T., Cao, Y., et al.: World and human action models towards gameplay ideation. Nature638(8051), 656–663 (2025)

  29. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Kim, S.W., Zhou, Y., Philion, J., Torralba, A., Fidler, S.: Learning to simulate dy- namic environments with gamegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1231–1240 (2020)

  30. [30]

    Kingma, D.P., Welling, M., et al.: Auto-encoding variational bayes (2013)

  31. [31]

    Kling: Kling ai: Next-generation ai creative studio.https://app.klingai.com/ (2024)

  32. [32]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V., Yan, J., Chiu, M.C., et al.: Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125 (2023)

  33. [33]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

  34. [34]

    Labs, W.: Generating worlds.https://www.worldlabs.ai/blog/generating- worlds(2024)

  35. [35]

    In: International conference on machine learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

  36. [36]

    arXiv preprint arXiv:2506.18903 (2025)

    Li, R., Torr, P., Vedaldi, A., Jakab, T.: Vmem: Consistent interactive video scene generation with surfel-indexed view memory. arXiv preprint arXiv:2506.18903 (2025)

  37. [37]

    arXiv preprint arXiv:2406.11838 (2024)

    Li, T., Tian, Y., Li, H., Deng, M., He, K.: Autoregressive image generation without vector quantization. arXiv preprint arXiv:2406.11838 (2024)

  38. [38]

    Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

    Li, W., Pan, W., Luan, P.C., Gao, Y., Alahi, A.: Stable video infinity: Infinite- length video generation with error recycling. arXiv preprint arXiv:2510.09212 (2025)

  39. [39]

    arXiv preprint arXiv:2506.15675 (2025)

    Li, Z., Li, C., Mao, X., Lin, S., Li, M., Zhao, S., Xu, Z., Li, X., Feng, Y., Sun, J., Li, Z., Zhang, F., Ai, J., Wang, Z., Wu, Y., He, T., Pang, J., Qiao, Y., Jia, Y., Zhang, K.: Sekai: A video dataset towards world exploration. arXiv preprint arXiv:2506.15675 (2025)

  40. [40]

    In: Proceedings of the IEEE/CVF international conference on computer vision (2025)

    Li, Z., Yu, H.X., Liu, W., Yang, Y., Herrmann, C., Wetzstein, G., Wu, J.: Wonder- play: Dynamic 3d scene generation from a single image and actions. In: Proceedings of the IEEE/CVF international conference on computer vision (2025)

  41. [41]

    Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

    Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025)

  42. [42]

    arXiv preprint arXiv:2412.06699 (2024) 18 J

    Ma, B., Gao, H., Deng, H., Luo, Z., Huang, T., Tang, L., Wang, X.: You see it, you got it: Learning 3d creation on pose-free videos at scale. arXiv preprint arXiv:2412.06699 (2024) 18 J. Yu et al

  43. [43]

    OpenAI: Creating video from text.https://openai.com/index/sora/(2024)

  44. [44]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

  45. [45]

    org/abs/2505.20171

    Po, R., Nitzan, Y., Zhang, R., Chen, B., Dao, T., Shechtman, E., Wetzstein, G., Huang, X.: Long-context state-space video world models (2025),https://arxiv. org/abs/2505.20171

  46. [46]

    In: International conference on machine learning (2021)

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning (2021)

  47. [47]

    arXiv preprint arXiv:2503.03751 (2025)

    Ren,X.,Shen,T.,Huang,J.,Ling,H.,Lu,Y.,Nimier-David,M.,Müller,T.,Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. arXiv preprint arXiv:2503.03751 (2025)

  48. [48]

    Runway: Runway : Tools for human imagination.https://runwayml.com/(2024)

  49. [49]

    History-Guided Video Diffusion

    Song, K., Chen, B., Simchowitz, M., Du, Y., Tedrake, R., Sitzmann, V.: History- guided video diffusion. arXiv preprint arXiv:2502.06764 (2025)

  50. [50]

    WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

    Sun, W., Zhang, H., Wang, H., Wu, J., Wang, Z., Wang, Z., Wang, Y., Zhang, J., Wang, T., Guo, C.: Worldplay: Towards long-term geometric consistency for real-time interactive world modeling. arXiv preprint arXiv:2512.14614 (2025)

  51. [51]

    Diffusion Models Are Real-Time Game Engines

    Valevski, D., Leviathan, Y., Arar, M., Fruchter, S.: Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837 (2024)

  52. [52]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

  53. [53]

    Wang, J., Yuan, Y., Zheng, R., Lin, Y., Gao, J., Chen, L.Z., Bao, Y., Zhang, Y., Zeng, C., Zhou, Y., Long, X., Zhu, H., Zhang, Z., Cao, X., Yao, Y.: Spatialvid: A large-scale video dataset with spatial annotations (2025),https://arxiv.org/ abs/2509.09676

  54. [54]

    Emu3: Next-Token Prediction is All You Need

    Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024)

  55. [55]

    In: ACM SIGGRAPH 2024 Conference Papers (2024)

    Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers (2024)

  56. [56]

    Wu,T.,Yang,S.,Po,R.,Xu,Y.,Liu,Z.,Lin,D.,Wetzstein,G.:Videoworldmodels with long-term spatial memory (2025),https://arxiv.org/abs/2506.05284

  57. [57]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wu, Z., Xiong, C., Ma, C.Y., Socher, R., Davis, L.S.: Adaframe: Adaptive frame selection for fast video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1278–1287 (2019)

  58. [58]

    arXiv preprint arXiv:2504.12369 (2025)

    Xiao, Z., Lan, Y., Zhou, Y., Ouyang, W., Yang, S., Zeng, Y., Pan, X.: Worldmem: Long-term consistent world simulation with memory. arXiv preprint arXiv:2504.12369 (2025)

  59. [59]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157 (2021)

  60. [60]

    In: Proceedings of the 41st International Conference on Machine Learning (2024)

    Yang, S., Walker, J.C., Parker-Holder, J., Du, Y., Bruce, J., Barreto, A., Abbeel, P., Schuurmans, D.: Position: Video as the new language for real-world decision making. In: Proceedings of the 41st International Conference on Machine Learning (2024)

  61. [61]

    Yang, S., Huang, W., Chu, R., Xiao, Y., Zhao, Y., Wang, X., Li, M., Xie, E., Chen, Y., Lu, Y., Chen, S.H.Y.: Longlive: Real-time interactive long video generation (2025) MemLearner: Learning to Query Context Memory 19

  62. [62]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

  63. [63]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., et al.: Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800 (2024)

  64. [64]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Yu, H.X., Duan, H., Herrmann, C., Freeman, W.T., Wu, J.: Wonderworld: Inter- active 3d scene generation from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5916–5926 (June 2025)

  65. [65]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yu, H.X., Duan, H., Hur, J., Sargent, K., Rubinstein, M., Freeman, W.T., Cole, F., Sun, D., Snavely, N., Wu, J., et al.: Wonderjourney: Going from anywhere to everywhere. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6658–6667 (2024)

  66. [66]

    Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

    Yu, J., Bai, J., Qin, Y., Liu, Q., Wang, X., Wan, P., Zhang, D., Liu, X.: Context as memory: Scene-consistent interactive long video generation with memory retrieval. arXiv preprint arXiv:2506.03141 (2025)

  67. [67]

    A survey of interactive generative video,

    Yu, J., Qin, Y., Che, H., Liu, Q., Wang, X., Wan, P., Zhang, D., Gai, K., Chen, H., Liu, X.: A survey of interactive generative video. arXiv preprint arXiv:2504.21853 (2025)

  68. [68]

    arXiv preprint arXiv:2503.17359 (2025)

    Yu, J., Qin, Y., Che, H., Liu, Q., Wang, X., Wan, P., Zhang, D., Liu, X.: Posi- tion: Interactive generative video as next-generation game engine. arXiv preprint arXiv:2503.17359 (2025)

  69. [69]

    Yu, J., Qin, Y., Wang, X., Wan, P., Zhang, D., Liu, X.: Gamefactory: Creating new games with generative interactive videos (2025)

  70. [70]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024)

  71. [71]

    Packing input frame context in next-frame prediction models for video generation.arXiv preprint arXiv:2504.12626, 2025

    Zhang, L., Agrawala, M.: Packing input frame context in next-frame prediction models for video generation. arXiv preprint arXiv:2504.12626 (2025)

  72. [72]

    DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

    Zhou, G., Pan, H., LeCun, Y., Pinto, L.: Dino-wm: World models on pre-trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983 (2024)

  73. [73]

    Learning 3d persistent embodied world models.arXiv preprint arXiv:2505.05495, 2025

    Zhou, S., Du, Y., Yang, Y., Han, L., Chen, P., Yeung, D.Y., Gan, C.: Learning 3d persistent embodied world models. arXiv preprint arXiv:2505.05495 (2025)

  74. [74]

    Zhou, Y., Wang, Y., Zhou, J., Chang, W., Guo, H., Li, Z., Ma, K., Li, X., Wang, Y., Zhu, H., Liu, M., Liu, D., Yang, J., Fu, Z., Chen, J., Shen, C., Pang, J., Zhang, K., He, T.: Omniworld: A multi-domain and multi-modal dataset for 4d world modeling (2025),https://arxiv.org/abs/2509.12201

  75. [75]

    Irasim: Learning interactive real-robot action simulators

    Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., Kong, T.: Irasim: Learning interac- tive real-robot action simulators. arXiv preprint arXiv:2406.14540 (2024) 20 J. Yu et al. A Details of Collected Dataset 3D Scenes and Dynamic Objects.We collect 13 diverse 3D scene assets from Fab.com5. To minimize the domain gap between rendered data and real-world videos...