MemLearner: Learning to Query Context memory for Video World Models

Jianhong Bai; Jianxiong Gao; Jiwen Yu; Kaiyi Huang; Kun Gai; Pengfei Wan; Quande Liu; Xihui Liu; Xintao Wang; Yiran Qin

arxiv: 2606.31734 · v1 · pith:7TBGVBXCnew · submitted 2026-06-30 · 💻 cs.CV

MemLearner: Learning to Query Context memory for Video World Models

Jiwen Yu , Jianxiong Gao , Jianhong Bai , Yiran Qin , Kaiyi Huang , Quande Liu , Xintao Wang , Pengfei Wan

show 2 more authors

Kun Gai Xihui Liu

This is my paper

Pith reviewed 2026-07-01 05:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords video world modelscontext memoryquery tokensscene consistencyocclusion handlingvideo generationadaptive context retrievalmulti-dataset training

0 comments

The pith

Learned query tokens let video world models maintain scene consistency over long sequences with occlusions and motion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method called MemLearner to solve the memory problem in video world models, where generated scenes become inconsistent after many frames. It replaces rule-based frame retrieval with query tokens that learn to pull relevant past context directly from the pre-trained generation model. This uses the model's existing visual knowledge without adding new modules trained from scratch and trains on a mix of rendered and real videos that include occlusions and moving objects. A reader would care because reliable long-horizon video prediction matters for interactive simulation and planning tasks.

Core claim

MemLearner is a learning-based adaptive context query method that inserts query tokens to bridge stored context frames and newly predicted tokens. By letting the video generation model itself perform the querying, the approach exploits pre-trained visual priors and avoids training extra modules from scratch. Training uses a multi-dataset strategy on long videos with camera pose annotations, combining annotated rendered sequences and unannotated real-world footage. Experiments show this yields stronger scene consistency and memory than prior rule-based retrieval, especially when objects are occluded or moving.

What carries the argument

Query tokens inserted to bridge context frames and predicted tokens, allowing the pre-trained model to learn adaptive retrieval of relevant history.

If this is right

Scene consistency improves in videos with frequent occlusions and dynamic objects.
Training succeeds on mixed datasets of annotated rendered videos and unannotated real videos.
No new modules need training from scratch because the existing generation model supplies the priors.
Efficient training and inference strategies become possible once query tokens replace manual retrieval rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same token-based querying idea could transfer to other sequence models that need long-term memory without extra retrieval networks.
Performance on very long horizons beyond the training lengths would test whether the learned queries scale without drift.
Combining the query mechanism with explicit action inputs might further stabilize predictions in interactive settings.
The multi-dataset strategy suggests similar gains are possible in other domains that mix synthetic and real sequential data.

Load-bearing premise

That the pre-trained video generation model already holds enough visual knowledge for its own query tokens to retrieve useful context more effectively than any fixed rule-based method.

What would settle it

Run both MemLearner and a strong rule-based baseline on a held-out set of long videos containing repeated full occlusions of moving objects and measure whether object identities and layouts remain consistent for more than 30 seconds.

Figures

Figures reproduced from arXiv: 2606.31734 by Jianhong Bai, Jianxiong Gao, Jiwen Yu, Kaiyi Huang, Kun Gai, Pengfei Wan, Quande Liu, Xihui Liu, Xintao Wang, Yiran Qin.

**Figure 1.** Figure 1: Teaser Demonstration. We propose a novel memory mechanism for video world models through learning-based adaptive context querying. Compared to prior rule-based context retrieval methods [36, 58, 66], our approach handles scenes with occlusions and dynamic objects. This figure highlights dynamic objects, representative occluders, video generation trajectory, and key frames. Abstract. Video World Models are … view at source ↗

**Figure 2.** Figure 2: Architecture clarification. (a) Interaction mechanism among C, Q, and P tokens; (b)&(c) Two designs for context querying, where the alternative design in (b) fails in experiments (Sec. 5.4), and the adopted design in (c) leverages the prior knowledge of the video generation model itself and performs effectively. video generation model ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Model Architecture. Our video generation model adopts a Diffusion Transformer. We concatenate C, Q, P tokens for context-conditioned long video generation. An optional camera encoder supports interactive control. 2.3 Learnable Query Tokens in Various Domains Learnable query tokens have been widely adopted for compressing visual context across various domains: In multimodal language models, Perceiver Resa… view at source ↗

**Figure 4.** Figure 4: Efficiency Strategies. (a) Strategy#1: Context querying in shallow Query Layers with C, Q, P tokens; deep Generative Layers use only Q, P tokens. (b) Strategy#2: Remove unnecessary attention computation for improved efficiency. where θ denotes all learnable parameters. Note that supervision is applied only to the noise predicted on predicted tokens. We concatenate C, Q and P along the frame dimension and … view at source ↗

**Figure 5.** Figure 5: Qualitative Results. MemLearner effectively handles both indoor (a, b) and outdoor (c, [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Real-World Qualitative Results. By incorporating real-world videos into training dataset, MemLearner generalizes to real-world scenes. MemLearner consistently outperforms baselines across all three evaluation settings. We further provide a user study in Appendix C.4, in which MemLearner is preferred over other SOTAs in terms of visual quality and scene consistency. 5.3 Qualitative Results We present quali… view at source ↗

**Figure 7.** Figure 7: Qualitative Comparison. MemLearner achieves optimal memory and visual quality, demonstrating learning-based adaptive context query effectiveness. Other methods show inconsistent memory performance [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Attention computation settings for ablation study in Tab. 4. Performance impact of attention scope when Q tokens act as queries [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Overview of the base text-to-video generation model. C Supplementary Experimental Results C.1 Results on Open-Source Video Models To verify that MemLearner generalizes across different models, we apply our method to Wan 2.1 (T2V-1.3B) [52], an open-source video Diffusion Transformer [43, 44]. Wan 2.1 shares a similar architecture with our internal model: each Transformer block contains a 3D attention modu… view at source ↗

**Figure 10.** Figure 10: Pseudocode for 3D attention computation. Generative Layers process P and Q tokens, while Query Layers process all C, Q, P tokens with different attention patterns. Adaptive Attention Across Frames and Timesteps. As shown in the left half of [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Attention visualization results. C.3 Evaluation on Real-World SpatialVID Dataset To evaluate generalization to real-world scenarios, we conduct quantitative experiments on the SpatialVID dataset [53]. We select SpatialVID for two reasons: (1) it contains long videos with sequences up to 900 frames, and (2) it provides per-frame camera pose annotations, enabling controlled evaluation. We randomly sample … view at source ↗

**Figure 12.** Figure 12: Additional attention computation settings. Horizontal tokens serve as queries; vertical tokens serve as keys/values. GT Comp. Revisit Comp. Setting PSNR↑ LPIPS↓ PSNR↑ LPIPS↓ Speed (fps)↑ (a)*5+(b)*23 21.23 0.2904 18.57 0.3230 0.54 (c)*5+(b)*23 21.15 0.2976 18.43 0.3291 0.46 (c)*28 21.27 0.2914 18.62 0.3212 0.24 (d)*28 21.34 0.2895 18.67 0.3227 0.28 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Additional Dataset Ablation Results. “Rendered” denotes results trained only on rendered videos, while “+Real” indicates results after adding real-world videos to the training set [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Additional Qualitative Comparison Results [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

read the original abstract

Video World Models are interactive video generation models that predict future world states based on user actions and history video frames. A critical challenge in video world models is the lack of memory, causing inconsistent generated scenes over extended durations. Previous methods explored rule-based context frame retrieval as memory, but they fail to generalize in scenarios with scene occlusions and dynamic objects. We propose MemLearner, a learning-based adaptive context query method using query tokens to bridge context and predicted tokens. By leveraging the video generation model itself for context querying, MemLearner exploits pre-trained visual priors without training additional modules from scratch, and incorporates efficient strategies for training and inference. We collect a dataset of long videos with scene occlusions and dynamic objects, paired with camera pose annotations, and propose a multi-dataset training strategy leveraging both annotated rendered and unannotated real-world videos. Extensive experiments demonstrate that MemLearner significantly outperforms prior video world models in terms of scene consistency and memory, particularly under challenging occlusion and dynamic scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemLearner adds learnable query tokens for adaptive memory in video world models, reusing pre-trained priors on a multi-dataset mix, but the outperformance claim needs the actual numbers and ablations to evaluate.

read the letter

The paper's main move is replacing rule-based context retrieval with a set of learnable query tokens that let a frozen video generator pull relevant history on the fly. This targets the consistency drop-off in long rollouts, especially with occlusions and moving objects, by training the queries on a combination of pose-annotated rendered videos and unannotated real ones.

What stands out is the decision to keep the base model untouched and train only the query mechanism plus efficient strategies. That avoids the cost of new modules and directly uses whatever visual priors the generator already has. The multi-dataset regime is a reasonable practical choice for getting long sequences with dynamics without full annotation everywhere.

The soft spot is the strength of the empirical case. The abstract states clear wins on scene consistency and memory under hard conditions, but without seeing the specific metrics, baselines, or ablation tables it is difficult to tell how large the gains are or whether they survive stronger controls. If the full experiments show consistent, reproducible improvements with proper controls, the contribution is straightforward and useful. If the margins are small or the comparisons limited, it stays incremental.

This is aimed at people building interactive video predictors for simulation or robotics. The architecture is concrete enough that a referee could check the implementation details and run the numbers. I would send it for review rather than desk reject; the problem is real and the proposed fix is testable.

Referee Report

1 major / 0 minor

Summary. The paper proposes MemLearner, a learning-based method that inserts query tokens into a frozen pre-trained video generator to perform adaptive context retrieval as memory for video world models. It collects a dataset of long videos with occlusions and dynamic objects (with camera poses), uses a multi-dataset training regime on both rendered and real videos, and claims this yields better scene consistency than rule-based retrieval without training new modules from scratch.

Significance. If the empirical claims hold, the approach would demonstrate that query-token mechanisms can exploit pre-trained visual priors for memory in long-horizon video generation, offering a lightweight alternative to rule-based or fully retrained memory modules under occlusion and dynamics.

major comments (1)

[Abstract] Abstract: the central claim that MemLearner 'significantly outperforms prior video world models in terms of scene consistency and memory' is asserted without any metrics, baselines, ablation results, or experimental details, so the primary empirical contribution cannot be assessed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their feedback on the abstract. We address the comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that MemLearner 'significantly outperforms prior video world models in terms of scene consistency and memory' is asserted without any metrics, baselines, ablation results, or experimental details, so the primary empirical contribution cannot be assessed.

Authors: We agree that the abstract presents the performance claim without supporting quantitative details, which limits immediate assessment of the empirical contribution. The full manuscript contains the requested elements (metrics, baselines, ablations) in the Experiments section. We will revise the abstract to include key quantitative results and baseline comparisons that substantiate the claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical ML architecture (query tokens inserted into a frozen pre-trained video generator) trained on a newly collected multi-dataset regime of long occluded/dynamic videos. No derivation chain, equations, or first-principles result is presented that reduces a claimed prediction to a quantity defined by the method's own fitted parameters. Performance claims rest on experimental comparisons rather than self-referential definitions or load-bearing self-citations. The construction is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Information is limited to the abstract; no equations or implementation details are available to identify fitted parameters or additional entities.

axioms (2)

domain assumption Video world models lack memory causing inconsistent generated scenes over extended durations
Explicitly stated as the critical challenge in the abstract.
domain assumption Rule-based context frame retrieval fails to generalize in scenarios with scene occlusions and dynamic objects
Stated as the limitation of previous methods in the abstract.

invented entities (1)

query tokens no independent evidence
purpose: To bridge context and predicted tokens for adaptive context querying
Introduced as the core mechanism of MemLearner in the abstract.

pith-pipeline@v0.9.1-grok · 5730 in / 1243 out tokens · 31421 ms · 2026-07-01T05:28:58.700454+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 46 canonical work pages · 23 internal anchors

[1]

ai, S., Teng, H., Jia, H., Sun, L., Li, L., Li, M., Tang, M., Han, S., Zhang, T., Zhang, W.Q., Luo, W., Kang, X., Sun, Y., Cao, Y., Huang, Y., Lin, Y., Fang, Y., Tao, Z., Zhang, Z., Wang, Z., Liu, Z., Shi, D., Su, G., Sun, H., Pan, H., Wang, J., Sheng, J., Cui, M., Hu, M., Yan, M., Yin, S., Zhang, S., Liu, T., Yin, X., Yang, X., Song, X., Hu, X., Zhang, Y...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Advances in neural information processing systems35, 23716– 23736 (2022)

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022)

2022
[3]

arXiv preprint arXiv:2507.02001 (2025)

Arnab, A., Iscen, A., Caron, M., Fathi, A., Schmid, C.: Temporal chain of thought: Long-video understanding by thinking in frames. arXiv preprint arXiv:2507.02001 (2025)

work page arXiv 2025
[4]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

arXiv preprint arXiv:2503.11647 (2025)

Bai, J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H.,Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. arXiv preprint arXiv:2503.11647 (2025)

work page arXiv 2025
[6]

Yu et al

Ball, P.J., Bauer, J., Belletti, F., Brownfield, B., Ephrat, A., Fruchter, S., Gupta, A., Holsheimer, K., Holynski, A., Hron, J., Kaplanis, C., Limont, M., McGill, M., Oliveira, Y., Parker-Holder, J., Perbet, F., Scully, G., Shar, J., Spencer, S., Tov, O., Villegas, R., Wang, E., Yung, J., Baetu, C., Berbel, J., Bridson, D., Bruce, J., Buttimore, G., Chak...

2025
[7]

org/abs/2405.04233

Bao, F., Xiang, C., Yue, G., He, G., Zhu, H., Zheng, K., Zhao, M., Liu, S., Wang, Y., Zhu, J.: Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. arXiv preprint arXiv:2405.04233 (2024)

work page arXiv 2024
[8]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

1901
[9]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Buch, S., Nagrani, A., Arnab, A., Schmid, C.: Flexible frame selection for efficient video reasoning. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29071–29082 (2025)

2025
[10]

In: arXiv (2025)

Cai, S., Yang, C., Zhang, L., Guo, Y., Xiao, J., Yang, Z., Xu, Y., Yang, Z., Yuille, A., Guibas, L., Agrawala, M., Jiang, L., Wetzstein, G.: Mixture of contexts for long video generation. In: arXiv (2025)

2025
[11]

arXiv preprint arXiv:2407.01392 (2024)

Chen, B., Monso, D.M., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Dif- fusion forcing: Next-token prediction meets full-sequence diffusion. arXiv preprint arXiv:2407.01392 (2024)

work page arXiv 2024
[12]

VRAG: Learning World Models for Interactive Video Generation

Chen, T., Hu, X., Ding, Z., Jin, C.: Learning world models for interactive video generation. arXiv preprint arXiv:2505.21996 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Cui, Y., Chen, H., Deng, H., Huang, X., Li, X., Liu, J., Liu, Y., Luo, Z., Wang, J., Wang, W., Wang, Y., Wang, C., Zhang, F., Zhao, Y., Pan, T., Li, X., Hao, Z., Ma, W., Chen, Z., Ao, Y., Huang, T., Wang, Z., Wang, X.: Emu3.5: Native multimodal models are world learners (2025),https://arxiv.org/abs/2510.26583

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., Wray, M.: Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

2018
[16]

Decart, E.: Oasis: A universe in a transformer.https://oasis-model.github.io/ (2024)

2024
[17]

DeepMind, G.: Veo 2: Our state-of-the-art video generation model.https:// deepmind.google/technologies/veo/veo-2/(2024)

2024
[18]

Autoregressive Video Generation without Vector Quantization

Deng, H., Pan, T., Diao, H., Luo, Z., Cui, Y., Lu, H., Shan, S., Qi, Y., Wang, X.: Autoregressive video generation without vector quantization. arXiv preprint arXiv:2412.14169 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

In: ICLR (2025)

Fu, X., Liu, X., Wang, X., Peng, S., Xia, M., Shi, X., Yuan, Z., Wan, P., Zhang, D., Lin, D.: 3dtrajmaster: Mastering 3d trajectory for multi-entity motion in video generation. In: ICLR (2025)

2025
[20]

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Gu, Y., Mao, W., Shou, M.Z.: Long-context autoregressive video modeling with next-frame prediction. arXiv preprint arXiv:2503.19325 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

arXiv preprint arXiv:2503.10589 (2025)

Guo, Y., Yang, C., Yang, Z., Ma, Z., Lin, Z., Yang, Z., Lin, D., Jiang, L.: Long context tuning for video generation. arXiv preprint arXiv:2503.10589 (2025)

work page arXiv 2025
[22]

Ha, D., Schmidhuber, J.: Recurrent world models facilitate policy evolution31 (2018)

2018
[23]

He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- ablingcameracontrolfortext-to-videogeneration.arXivpreprintarXiv:2404.02101 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Classifier-Free Diffusion Guidance

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022) MemLearner: Learning to Query Context Memory 17

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Relic: Interac- tive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

Hong, Y., Mei, Y., Ge, C., Xu, Y., Zhou, Y., Bi, S., Hold-Geoffroy, Y., Roberts, M., Fisher, M., Shechtman, E., et al.: Relic: Interactive video world model with long-horizon memory. arXiv preprint arXiv:2512.04040 (2025)

work page arXiv 2025
[26]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024
[28]

Nature638(8051), 656–663 (2025)

Kanervisto, A., Bignell, D., Wen, L.Y., Grayson, M., Georgescu, R., Valcar- cel Macua, S., Tan, S.Z., Rashid, T., Pearce, T., Cao, Y., et al.: World and human action models towards gameplay ideation. Nature638(8051), 656–663 (2025)

2025
[29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Kim, S.W., Zhou, Y., Philion, J., Torralba, A., Fidler, S.: Learning to simulate dy- namic environments with gamegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1231–1240 (2020)

2020
[30]

Kingma, D.P., Welling, M., et al.: Auto-encoding variational bayes (2013)

2013
[31]

Kling: Kling ai: Next-generation ai creative studio.https://app.klingai.com/ (2024)

2024
[32]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V., Yan, J., Chiu, M.C., et al.: Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Labs, W.: Generating worlds.https://www.worldlabs.ai/blog/generating- worlds(2024)

2024
[35]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

2023
[36]

arXiv preprint arXiv:2506.18903 (2025)

Li, R., Torr, P., Vedaldi, A., Jakab, T.: Vmem: Consistent interactive video scene generation with surfel-indexed view memory. arXiv preprint arXiv:2506.18903 (2025)

work page arXiv 2025
[37]

arXiv preprint arXiv:2406.11838 (2024)

Li, T., Tian, Y., Li, H., Deng, M., He, K.: Autoregressive image generation without vector quantization. arXiv preprint arXiv:2406.11838 (2024)

work page arXiv 2024
[38]

Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

Li, W., Pan, W., Luan, P.C., Gao, Y., Alahi, A.: Stable video infinity: Infinite- length video generation with error recycling. arXiv preprint arXiv:2510.09212 (2025)

work page arXiv 2025
[39]

arXiv preprint arXiv:2506.15675 (2025)

Li, Z., Li, C., Mao, X., Lin, S., Li, M., Zhao, S., Xu, Z., Li, X., Feng, Y., Sun, J., Li, Z., Zhang, F., Ai, J., Wang, Z., Wu, Y., He, T., Pang, J., Qiao, Y., Jia, Y., Zhang, K.: Sekai: A video dataset towards world exploration. arXiv preprint arXiv:2506.15675 (2025)

work page arXiv 2025
[40]

In: Proceedings of the IEEE/CVF international conference on computer vision (2025)

Li, Z., Yu, H.X., Liu, W., Yang, Y., Herrmann, C., Wetzstein, G., Wu, J.: Wonder- play: Dynamic 3d scene generation from a single image and actions. In: Proceedings of the IEEE/CVF international conference on computer vision (2025)

2025
[41]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

arXiv preprint arXiv:2412.06699 (2024) 18 J

Ma, B., Gao, H., Deng, H., Luo, Z., Huang, T., Tang, L., Wang, X.: You see it, you got it: Learning 3d creation on pose-free videos at scale. arXiv preprint arXiv:2412.06699 (2024) 18 J. Yu et al

work page arXiv 2024
[43]

OpenAI: Creating video from text.https://openai.com/index/sora/(2024)

2024
[44]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

2023
[45]

org/abs/2505.20171

Po, R., Nitzan, Y., Zhang, R., Chen, B., Dao, T., Shechtman, E., Wetzstein, G., Huang, X.: Long-context state-space video world models (2025),https://arxiv. org/abs/2505.20171

work page arXiv 2025
[46]

In: International conference on machine learning (2021)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning (2021)

2021
[47]

arXiv preprint arXiv:2503.03751 (2025)

Ren,X.,Shen,T.,Huang,J.,Ling,H.,Lu,Y.,Nimier-David,M.,Müller,T.,Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. arXiv preprint arXiv:2503.03751 (2025)

work page arXiv 2025
[48]

Runway: Runway : Tools for human imagination.https://runwayml.com/(2024)

2024
[49]

History-Guided Video Diffusion

Song, K., Chen, B., Simchowitz, M., Du, Y., Tedrake, R., Sitzmann, V.: History- guided video diffusion. arXiv preprint arXiv:2502.06764 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Sun, W., Zhang, H., Wang, H., Wu, J., Wang, Z., Wang, Z., Wang, Y., Zhang, J., Wang, T., Guo, C.: Worldplay: Towards long-term geometric consistency for real-time interactive world modeling. arXiv preprint arXiv:2512.14614 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Diffusion Models Are Real-Time Game Engines

Valevski, D., Leviathan, Y., Arar, M., Fruchter, S.: Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Wan: Open and Advanced Large-Scale Video Generative Models

Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Wang, J., Yuan, Y., Zheng, R., Lin, Y., Gao, J., Chen, L.Z., Bao, Y., Zhang, Y., Zeng, C., Zhou, Y., Long, X., Zhu, H., Zhang, Z., Cao, X., Yao, Y.: Spatialvid: A large-scale video dataset with spatial annotations (2025),https://arxiv.org/ abs/2509.09676

work page arXiv 2025
[54]

Emu3: Next-Token Prediction is All You Need

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

In: ACM SIGGRAPH 2024 Conference Papers (2024)

Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers (2024)

2024
[56]

Wu,T.,Yang,S.,Po,R.,Xu,Y.,Liu,Z.,Lin,D.,Wetzstein,G.:Videoworldmodels with long-term spatial memory (2025),https://arxiv.org/abs/2506.05284

work page arXiv 2025
[57]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, Z., Xiong, C., Ma, C.Y., Socher, R., Davis, L.S.: Adaframe: Adaptive frame selection for fast video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1278–1287 (2019)

2019
[58]

arXiv preprint arXiv:2504.12369 (2025)

Xiao, Z., Lan, Y., Zhou, Y., Ouyang, W., Yang, S., Zeng, Y., Pan, X.: Worldmem: Long-term consistent world simulation with memory. arXiv preprint arXiv:2504.12369 (2025)

work page arXiv 2025
[59]

VideoGPT: Video Generation using VQ-VAE and Transformers

Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[60]

In: Proceedings of the 41st International Conference on Machine Learning (2024)

Yang, S., Walker, J.C., Parker-Holder, J., Du, Y., Bruce, J., Barreto, A., Abbeel, P., Schuurmans, D.: Position: Video as the new language for real-world decision making. In: Proceedings of the 41st International Conference on Machine Learning (2024)

2024
[61]

Yang, S., Huang, W., Chu, R., Xiao, Y., Zhao, Y., Wang, X., Li, M., Xie, E., Chen, Y., Lu, Y., Chen, S.H.Y.: Longlive: Real-time interactive long video generation (2025) MemLearner: Learning to Query Context Memory 19

2025
[62]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., et al.: Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yu, H.X., Duan, H., Herrmann, C., Freeman, W.T., Wu, J.: Wonderworld: Inter- active 3d scene generation from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5916–5926 (June 2025)

2025
[65]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yu, H.X., Duan, H., Hur, J., Sargent, K., Rubinstein, M., Freeman, W.T., Cole, F., Sun, D., Snavely, N., Wu, J., et al.: Wonderjourney: Going from anywhere to everywhere. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6658–6667 (2024)

2024
[66]

Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

Yu, J., Bai, J., Qin, Y., Liu, Q., Wang, X., Wan, P., Zhang, D., Liu, X.: Context as memory: Scene-consistent interactive long video generation with memory retrieval. arXiv preprint arXiv:2506.03141 (2025)

work page arXiv 2025
[67]

A survey of interactive generative video,

Yu, J., Qin, Y., Che, H., Liu, Q., Wang, X., Wan, P., Zhang, D., Gai, K., Chen, H., Liu, X.: A survey of interactive generative video. arXiv preprint arXiv:2504.21853 (2025)

work page arXiv 2025
[68]

arXiv preprint arXiv:2503.17359 (2025)

Yu, J., Qin, Y., Che, H., Liu, Q., Wang, X., Wan, P., Zhang, D., Liu, X.: Posi- tion: Interactive generative video as next-generation game engine. arXiv preprint arXiv:2503.17359 (2025)

work page arXiv 2025
[69]

Yu, J., Qin, Y., Wang, X., Wan, P., Zhang, D., Liu, X.: Gamefactory: Creating new games with generative interactive videos (2025)

2025
[70]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Packing input frame context in next-frame prediction models for video generation.arXiv preprint arXiv:2504.12626, 2025

Zhang, L., Agrawala, M.: Packing input frame context in next-frame prediction models for video generation. arXiv preprint arXiv:2504.12626 (2025)

work page arXiv 2025
[72]

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Zhou, G., Pan, H., LeCun, Y., Pinto, L.: Dino-wm: World models on pre-trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

Learning 3d persistent embodied world models.arXiv preprint arXiv:2505.05495, 2025

Zhou, S., Du, Y., Yang, Y., Han, L., Chen, P., Yeung, D.Y., Gan, C.: Learning 3d persistent embodied world models. arXiv preprint arXiv:2505.05495 (2025)

work page arXiv 2025
[74]

Zhou, Y., Wang, Y., Zhou, J., Chang, W., Guo, H., Li, Z., Ma, K., Li, X., Wang, Y., Zhu, H., Liu, M., Liu, D., Yang, J., Fu, Z., Chen, J., Shen, C., Pang, J., Zhang, K., He, T.: Omniworld: A multi-domain and multi-modal dataset for 4d world modeling (2025),https://arxiv.org/abs/2509.12201

work page arXiv 2025
[75]

Irasim: Learning interactive real-robot action simulators

Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., Kong, T.: Irasim: Learning interac- tive real-robot action simulators. arXiv preprint arXiv:2406.14540 (2024) 20 J. Yu et al. A Details of Collected Dataset 3D Scenes and Dynamic Objects.We collect 13 diverse 3D scene assets from Fab.com5. To minimize the domain gap between rendered data and real-world videos...

work page arXiv 2024

[1] [1]

ai, S., Teng, H., Jia, H., Sun, L., Li, L., Li, M., Tang, M., Han, S., Zhang, T., Zhang, W.Q., Luo, W., Kang, X., Sun, Y., Cao, Y., Huang, Y., Lin, Y., Fang, Y., Tao, Z., Zhang, Z., Wang, Z., Liu, Z., Shi, D., Su, G., Sun, H., Pan, H., Wang, J., Sheng, J., Cui, M., Hu, M., Yan, M., Yin, S., Zhang, S., Liu, T., Yin, X., Yang, X., Song, X., Hu, X., Zhang, Y...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Advances in neural information processing systems35, 23716– 23736 (2022)

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022)

2022

[3] [3]

arXiv preprint arXiv:2507.02001 (2025)

Arnab, A., Iscen, A., Caron, M., Fathi, A., Schmid, C.: Temporal chain of thought: Long-video understanding by thinking in frames. arXiv preprint arXiv:2507.02001 (2025)

work page arXiv 2025

[4] [4]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

arXiv preprint arXiv:2503.11647 (2025)

Bai, J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H.,Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. arXiv preprint arXiv:2503.11647 (2025)

work page arXiv 2025

[6] [6]

Yu et al

Ball, P.J., Bauer, J., Belletti, F., Brownfield, B., Ephrat, A., Fruchter, S., Gupta, A., Holsheimer, K., Holynski, A., Hron, J., Kaplanis, C., Limont, M., McGill, M., Oliveira, Y., Parker-Holder, J., Perbet, F., Scully, G., Shar, J., Spencer, S., Tov, O., Villegas, R., Wang, E., Yung, J., Baetu, C., Berbel, J., Bridson, D., Bruce, J., Buttimore, G., Chak...

2025

[7] [7]

org/abs/2405.04233

Bao, F., Xiang, C., Yue, G., He, G., Zhu, H., Zheng, K., Zhao, M., Liu, S., Wang, Y., Zhu, J.: Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models. arXiv preprint arXiv:2405.04233 (2024)

work page arXiv 2024

[8] [8]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

1901

[9] [9]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Buch, S., Nagrani, A., Arnab, A., Schmid, C.: Flexible frame selection for efficient video reasoning. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29071–29082 (2025)

2025

[10] [10]

In: arXiv (2025)

Cai, S., Yang, C., Zhang, L., Guo, Y., Xiao, J., Yang, Z., Xu, Y., Yang, Z., Yuille, A., Guibas, L., Agrawala, M., Jiang, L., Wetzstein, G.: Mixture of contexts for long video generation. In: arXiv (2025)

2025

[11] [11]

arXiv preprint arXiv:2407.01392 (2024)

Chen, B., Monso, D.M., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Dif- fusion forcing: Next-token prediction meets full-sequence diffusion. arXiv preprint arXiv:2407.01392 (2024)

work page arXiv 2024

[12] [12]

VRAG: Learning World Models for Interactive Video Generation

Chen, T., Hu, X., Ding, Z., Jin, C.: Learning world models for interactive video generation. arXiv preprint arXiv:2505.21996 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Cui, Y., Chen, H., Deng, H., Huang, X., Li, X., Liu, J., Liu, Y., Luo, Z., Wang, J., Wang, W., Wang, Y., Wang, C., Zhang, F., Zhao, Y., Pan, T., Li, X., Hao, Z., Ma, W., Chen, Z., Ao, Y., Huang, T., Wang, Z., Wang, X.: Emu3.5: Native multimodal models are world learners (2025),https://arxiv.org/abs/2510.26583

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., Wray, M.: Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

2018

[16] [16]

Decart, E.: Oasis: A universe in a transformer.https://oasis-model.github.io/ (2024)

2024

[17] [17]

DeepMind, G.: Veo 2: Our state-of-the-art video generation model.https:// deepmind.google/technologies/veo/veo-2/(2024)

2024

[18] [18]

Autoregressive Video Generation without Vector Quantization

Deng, H., Pan, T., Diao, H., Luo, Z., Cui, Y., Lu, H., Shan, S., Qi, Y., Wang, X.: Autoregressive video generation without vector quantization. arXiv preprint arXiv:2412.14169 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

In: ICLR (2025)

Fu, X., Liu, X., Wang, X., Peng, S., Xia, M., Shi, X., Yuan, Z., Wan, P., Zhang, D., Lin, D.: 3dtrajmaster: Mastering 3d trajectory for multi-entity motion in video generation. In: ICLR (2025)

2025

[20] [20]

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Gu, Y., Mao, W., Shou, M.Z.: Long-context autoregressive video modeling with next-frame prediction. arXiv preprint arXiv:2503.19325 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

arXiv preprint arXiv:2503.10589 (2025)

Guo, Y., Yang, C., Yang, Z., Ma, Z., Lin, Z., Yang, Z., Lin, D., Jiang, L.: Long context tuning for video generation. arXiv preprint arXiv:2503.10589 (2025)

work page arXiv 2025

[22] [22]

Ha, D., Schmidhuber, J.: Recurrent world models facilitate policy evolution31 (2018)

2018

[23] [23]

He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- ablingcameracontrolfortext-to-videogeneration.arXivpreprintarXiv:2404.02101 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Classifier-Free Diffusion Guidance

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022) MemLearner: Learning to Query Context Memory 17

work page internal anchor Pith review Pith/arXiv arXiv 2022

[25] [25]

Relic: Interac- tive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

Hong, Y., Mei, Y., Ge, C., Xu, Y., Zhou, Y., Bi, S., Hold-Geoffroy, Y., Roberts, M., Fisher, M., Shechtman, E., et al.: Relic: Interactive video world model with long-horizon memory. arXiv preprint arXiv:2512.04040 (2025)

work page arXiv 2025

[26] [26]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024

[28] [28]

Nature638(8051), 656–663 (2025)

Kanervisto, A., Bignell, D., Wen, L.Y., Grayson, M., Georgescu, R., Valcar- cel Macua, S., Tan, S.Z., Rashid, T., Pearce, T., Cao, Y., et al.: World and human action models towards gameplay ideation. Nature638(8051), 656–663 (2025)

2025

[29] [29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Kim, S.W., Zhou, Y., Philion, J., Torralba, A., Fidler, S.: Learning to simulate dy- namic environments with gamegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1231–1240 (2020)

2020

[30] [30]

Kingma, D.P., Welling, M., et al.: Auto-encoding variational bayes (2013)

2013

[31] [31]

Kling: Kling ai: Next-generation ai creative studio.https://app.klingai.com/ (2024)

2024

[32] [32]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V., Yan, J., Chiu, M.C., et al.: Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Labs, W.: Generating worlds.https://www.worldlabs.ai/blog/generating- worlds(2024)

2024

[35] [35]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

2023

[36] [36]

arXiv preprint arXiv:2506.18903 (2025)

Li, R., Torr, P., Vedaldi, A., Jakab, T.: Vmem: Consistent interactive video scene generation with surfel-indexed view memory. arXiv preprint arXiv:2506.18903 (2025)

work page arXiv 2025

[37] [37]

arXiv preprint arXiv:2406.11838 (2024)

Li, T., Tian, Y., Li, H., Deng, M., He, K.: Autoregressive image generation without vector quantization. arXiv preprint arXiv:2406.11838 (2024)

work page arXiv 2024

[38] [38]

Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

Li, W., Pan, W., Luan, P.C., Gao, Y., Alahi, A.: Stable video infinity: Infinite- length video generation with error recycling. arXiv preprint arXiv:2510.09212 (2025)

work page arXiv 2025

[39] [39]

arXiv preprint arXiv:2506.15675 (2025)

Li, Z., Li, C., Mao, X., Lin, S., Li, M., Zhao, S., Xu, Z., Li, X., Feng, Y., Sun, J., Li, Z., Zhang, F., Ai, J., Wang, Z., Wu, Y., He, T., Pang, J., Qiao, Y., Jia, Y., Zhang, K.: Sekai: A video dataset towards world exploration. arXiv preprint arXiv:2506.15675 (2025)

work page arXiv 2025

[40] [40]

In: Proceedings of the IEEE/CVF international conference on computer vision (2025)

Li, Z., Yu, H.X., Liu, W., Yang, Y., Herrmann, C., Wetzstein, G., Wu, J.: Wonder- play: Dynamic 3d scene generation from a single image and actions. In: Proceedings of the IEEE/CVF international conference on computer vision (2025)

2025

[41] [41]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

arXiv preprint arXiv:2412.06699 (2024) 18 J

Ma, B., Gao, H., Deng, H., Luo, Z., Huang, T., Tang, L., Wang, X.: You see it, you got it: Learning 3d creation on pose-free videos at scale. arXiv preprint arXiv:2412.06699 (2024) 18 J. Yu et al

work page arXiv 2024

[43] [43]

OpenAI: Creating video from text.https://openai.com/index/sora/(2024)

2024

[44] [44]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

2023

[45] [45]

org/abs/2505.20171

Po, R., Nitzan, Y., Zhang, R., Chen, B., Dao, T., Shechtman, E., Wetzstein, G., Huang, X.: Long-context state-space video world models (2025),https://arxiv. org/abs/2505.20171

work page arXiv 2025

[46] [46]

In: International conference on machine learning (2021)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning (2021)

2021

[47] [47]

arXiv preprint arXiv:2503.03751 (2025)

Ren,X.,Shen,T.,Huang,J.,Ling,H.,Lu,Y.,Nimier-David,M.,Müller,T.,Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. arXiv preprint arXiv:2503.03751 (2025)

work page arXiv 2025

[48] [48]

Runway: Runway : Tools for human imagination.https://runwayml.com/(2024)

2024

[49] [49]

History-Guided Video Diffusion

Song, K., Chen, B., Simchowitz, M., Du, Y., Tedrake, R., Sitzmann, V.: History- guided video diffusion. arXiv preprint arXiv:2502.06764 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Sun, W., Zhang, H., Wang, H., Wu, J., Wang, Z., Wang, Z., Wang, Y., Zhang, J., Wang, T., Guo, C.: Worldplay: Towards long-term geometric consistency for real-time interactive world modeling. arXiv preprint arXiv:2512.14614 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Diffusion Models Are Real-Time Game Engines

Valevski, D., Leviathan, Y., Arar, M., Fruchter, S.: Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Wan: Open and Advanced Large-Scale Video Generative Models

Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Wang, J., Yuan, Y., Zheng, R., Lin, Y., Gao, J., Chen, L.Z., Bao, Y., Zhang, Y., Zeng, C., Zhou, Y., Long, X., Zhu, H., Zhang, Z., Cao, X., Yao, Y.: Spatialvid: A large-scale video dataset with spatial annotations (2025),https://arxiv.org/ abs/2509.09676

work page arXiv 2025

[54] [54]

Emu3: Next-Token Prediction is All You Need

Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

In: ACM SIGGRAPH 2024 Conference Papers (2024)

Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers (2024)

2024

[56] [56]

Wu,T.,Yang,S.,Po,R.,Xu,Y.,Liu,Z.,Lin,D.,Wetzstein,G.:Videoworldmodels with long-term spatial memory (2025),https://arxiv.org/abs/2506.05284

work page arXiv 2025

[57] [57]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, Z., Xiong, C., Ma, C.Y., Socher, R., Davis, L.S.: Adaframe: Adaptive frame selection for fast video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1278–1287 (2019)

2019

[58] [58]

arXiv preprint arXiv:2504.12369 (2025)

Xiao, Z., Lan, Y., Zhou, Y., Ouyang, W., Yang, S., Zeng, Y., Pan, X.: Worldmem: Long-term consistent world simulation with memory. arXiv preprint arXiv:2504.12369 (2025)

work page arXiv 2025

[59] [59]

VideoGPT: Video Generation using VQ-VAE and Transformers

Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[60] [60]

In: Proceedings of the 41st International Conference on Machine Learning (2024)

Yang, S., Walker, J.C., Parker-Holder, J., Du, Y., Bruce, J., Barreto, A., Abbeel, P., Schuurmans, D.: Position: Video as the new language for real-world decision making. In: Proceedings of the 41st International Conference on Machine Learning (2024)

2024

[61] [61]

Yang, S., Huang, W., Chu, R., Xiao, Y., Zhao, Y., Wang, X., Li, M., Xie, E., Chen, Y., Lu, Y., Chen, S.H.Y.: Longlive: Real-time interactive long video generation (2025) MemLearner: Learning to Query Context Memory 19

2025

[62] [62]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [63]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., et al.: Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yu, H.X., Duan, H., Herrmann, C., Freeman, W.T., Wu, J.: Wonderworld: Inter- active 3d scene generation from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5916–5926 (June 2025)

2025

[65] [65]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yu, H.X., Duan, H., Hur, J., Sargent, K., Rubinstein, M., Freeman, W.T., Cole, F., Sun, D., Snavely, N., Wu, J., et al.: Wonderjourney: Going from anywhere to everywhere. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6658–6667 (2024)

2024

[66] [66]

Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141, 2025

Yu, J., Bai, J., Qin, Y., Liu, Q., Wang, X., Wan, P., Zhang, D., Liu, X.: Context as memory: Scene-consistent interactive long video generation with memory retrieval. arXiv preprint arXiv:2506.03141 (2025)

work page arXiv 2025

[67] [67]

A survey of interactive generative video,

Yu, J., Qin, Y., Che, H., Liu, Q., Wang, X., Wan, P., Zhang, D., Gai, K., Chen, H., Liu, X.: A survey of interactive generative video. arXiv preprint arXiv:2504.21853 (2025)

work page arXiv 2025

[68] [68]

arXiv preprint arXiv:2503.17359 (2025)

Yu, J., Qin, Y., Che, H., Liu, Q., Wang, X., Wan, P., Zhang, D., Liu, X.: Posi- tion: Interactive generative video as next-generation game engine. arXiv preprint arXiv:2503.17359 (2025)

work page arXiv 2025

[69] [69]

Yu, J., Qin, Y., Wang, X., Wan, P., Zhang, D., Liu, X.: Gamefactory: Creating new games with generative interactive videos (2025)

2025

[70] [70]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[71] [71]

Packing input frame context in next-frame prediction models for video generation.arXiv preprint arXiv:2504.12626, 2025

Zhang, L., Agrawala, M.: Packing input frame context in next-frame prediction models for video generation. arXiv preprint arXiv:2504.12626 (2025)

work page arXiv 2025

[72] [72]

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Zhou, G., Pan, H., LeCun, Y., Pinto, L.: Dino-wm: World models on pre-trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[73] [73]

Learning 3d persistent embodied world models.arXiv preprint arXiv:2505.05495, 2025

Zhou, S., Du, Y., Yang, Y., Han, L., Chen, P., Yeung, D.Y., Gan, C.: Learning 3d persistent embodied world models. arXiv preprint arXiv:2505.05495 (2025)

work page arXiv 2025

[74] [74]

Zhou, Y., Wang, Y., Zhou, J., Chang, W., Guo, H., Li, Z., Ma, K., Li, X., Wang, Y., Zhu, H., Liu, M., Liu, D., Yang, J., Fu, Z., Chen, J., Shen, C., Pang, J., Zhang, K., He, T.: Omniworld: A multi-domain and multi-modal dataset for 4d world modeling (2025),https://arxiv.org/abs/2509.12201

work page arXiv 2025

[75] [75]

Irasim: Learning interactive real-robot action simulators

Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., Kong, T.: Irasim: Learning interac- tive real-robot action simulators. arXiv preprint arXiv:2406.14540 (2024) 20 J. Yu et al. A Details of Collected Dataset 3D Scenes and Dynamic Objects.We collect 13 diverse 3D scene assets from Fab.com5. To minimize the domain gap between rendered data and real-world videos...

work page arXiv 2024