arxiv: 2604.08995 · v2 · submitted 2026-04-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Baixin Xu, Biao Jiang, Boyi Jiang, Fei Kang, Haofeng Sun, Hua Xue, Jiangbo Pei, Jiaxing Li, Kaichen Huang, Liang Hu, Mengyin An, Peiyu Wang, Wanli Ouyang, Wei Li, Xianglong He, Yahui Zhou, Yangguang Li, Yang Liu, Yichen Wei, Yidan Xietian, Zexiang Liu, Zidong Wang, Zile Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords interactive video generationworld modelsdiffusion modelsreal-time generationlong-horizon consistencymemory augmentationvideo generation

0 comments

The pith

Matrix-Game 3.0 generates 720p interactive video in real time at 40 frames per second while holding memory consistency over minute-long sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Matrix-Game 3.0 as a memory-augmented interactive world model that extends prior versions through three main upgrades: an industrial-scale data engine producing Video-Pose-Action-Prompt quadruplets from synthetic, game, and real-world sources; a training approach that models prediction residuals and re-injects imperfect generated frames so the model learns self-correction, paired with camera-aware memory retrieval and injection for long-horizon spatiotemporal consistency; and a multi-segment autoregressive distillation pipeline using Distribution Matching Distillation together with quantization and VAE pruning for efficient inference. These elements together enable a 5B model to reach 40 FPS at 720p while keeping stable memory over extended sequences, overcoming the prior trade-off between real-time speed, resolution, and long-term consistency in diffusion-based world models. Scaling the approach to a 2x14B model further boosts quality and generalization.

Core claim

By combining an upgraded infinite data engine, residual modeling with re-injection of imperfect frames for self-correction, camera-aware memory retrieval and injection, and DMD-based multi-segment autoregressive distillation with quantization and pruning, Matrix-Game 3.0 achieves up to 40 FPS real-time 720p generation using a 5B model and maintains stable memory consistency over minute-long sequences.

What carries the argument

Camera-aware memory retrieval and injection combined with residual prediction modeling that re-injects generated frames during training to enable self-correction.

If this is right

Interactive applications can sustain long-form video generation at real-time speeds without resets or loss of consistency.
Larger models trained with the same residual and memory methods show improved dynamics and generalization.
The approach supplies a direct route to deployable industrial-scale world models for simulation and gaming.
Real-time high-resolution output becomes practical for streaming interactive scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The residual re-injection method may transfer to other video diffusion models to lengthen their reliable generation horizon without extra supervision.
Camera-aware memory retrieval indicates that explicit viewpoint conditioning is key to preventing drift in 3D-consistent world models.
If self-correction from noisy self-generated data works reliably, training loops could increasingly rely on the model's own outputs rather than only clean ground truth.

Load-bearing premise

Re-injecting imperfect generated frames during training plus camera-aware memory retrieval will produce long-horizon spatiotemporal consistency without visible drift or compounding errors once the model leaves the training distribution.

What would settle it

Generating minute-long interactive sequences on novel out-of-distribution actions or environments and measuring whether object positions, camera trajectories, and visual details remain consistent without accumulated artifacts or drift.

Figures

Figures reproduced from arXiv: 2604.08995 by Baixin Xu, Biao Jiang, Boyi Jiang, Fei Kang, Haofeng Sun, Hua Xue, Jiangbo Pei, Jiaxing Li, Kaichen Huang, Liang Hu, Mengyin An, Peiyu Wang, Wanli Ouyang, Wei Li, Xianglong He, Yahui Zhou, Yangguang Li, Yang Liu, Yichen Wei, Yidan Xietian, Zexiang Liu, Zidong Wang, Zile Wang.

**Figure 1.** Figure 1: Matrix-Game 3.0 introduces precise action control and long-horizon memory retrieval, enabling an interactive world model with long-term memory and real-time performance of up to 40 FPS. Abstract With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-en… view at source ↗

**Figure 2.** Figure 2: Overview of Matrix-Game 3.0. Our framework unifies Unreal Engine–based data generation, memory-augmented DiT training with an error buffer, and accelerated real-time deployment. It generates long-horizon training videos with paired action and camera-pose supervision, learns action-conditioned generation with memory-enhanced consistency, and supports real-time inference through few-step sampling, quantizat… view at source ↗

**Figure 3.** Figure 3: Illustration of our interactive base model. We jointly perform error-aware modeling over the past and current latent frames, while explicitly injecting action conditions into the model. This design enables autoregressive, long-horizon interactive generation and maintains consistency with the subsequent distillation stage. To enable precise action control, we follow the design of Matrix-Game 2.0 [15] and Ga… view at source ↗

**Figure 4.** Figure 4: Illustration of our memory-augmented base model. Built upon the bidirectional base model, we incorporate retrieved memory frames as additional conditions and introduce small memory perturbations to enhance robustness. This design enables the base model to jointly model long-term memory, short-term history, and the current prediction target under the same attention mode as the base model. few-step settings.… view at source ↗

**Figure 5.** Figure 5: Frame-level self-attention visualization for the memory-enhanced DiT. Based on these observations, we adopt a unified DiT framework that jointly models long-term memory, temporally consistent history, and the current prediction target. Our first key design is a joint self-attention mechanism. Instead of treating memory as an external branch, we place retrieved memory latents, temporally aligned recent … view at source ↗

**Figure 6.** Figure 6: Illustration of our few-step distillation stage. The bidirectional student performs multisegment rollouts to mimic actual few-step inference, with the final segment used for distribution matching, thereby ensuring training-inference consistency. 3.3 Training-Inference Aligned Few-step Distillation Existing distillation methods [8, 19, 49, 60] typically adopt causal students that perform chunk-wise inferen… view at source ↗

**Figure 7.** Figure 7: Representative scenes and agent trajectories from our data engine. 4.1 Unreal Engine-based Data Production Our Unreal Engine-based pipeline—Unreal-Gen—produces cinema-quality video from more than 1,000 custom UE5 scenes built on Nanite virtualized geometry and Lumen global illumination. The core design principle is tick-level synchronization: in each rendered frame t, the system simultaneously captures: Dt… view at source ↗

**Figure 8.** Figure 8: Qualitative results of our interactive base model. The action symbol denotes the action of [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Memory-based scene revisitation in long videos. Each row is sampled uniformly in time; [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative results of our 28B model on third-person video generation. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative results of our distilled model. Each row is sampled uniformly over time. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: For each case, the top row shows the original video and the bottom row shows the [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

read the original abstract

With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Matrix-Game 3.0 is a practical engineering update on their 2.0 system that hits claimed 40 FPS at 720p with minute-scale consistency via residual training and memory retrieval, but the abstract supplies no comparisons or ablations to back the numbers.

read the letter

This paper is an incremental systems refinement rather than a new framework. The authors scale up their prior Matrix-Game work with three concrete changes: an industrial data engine that mixes Unreal synthetics, AAA game captures, and real videos into Video-Pose-Action-Prompt quadruplets; a training loop that predicts residuals and re-injects the model's own imperfect frames so it learns self-correction, paired with camera-aware memory retrieval and injection; and inference optimizations using multi-segment DMD distillation, quantization, and VAE pruning. The 5B model reportedly reaches 40 FPS at 720p while holding consistency over minute-long sequences, and the 2x14B version improves quality and generalization as expected. These pieces together give a workable recipe for real-time interactive generation that practitioners could actually try to deploy.

Referee Report

3 major / 2 minor

Summary. The manuscript presents Matrix-Game 3.0, a memory-augmented interactive world model extending Matrix-Game 2.0 for 720p real-time long-form video generation. It introduces an industrial-scale data engine generating Video-Pose-Action-Prompt quadruplets from synthetic Unreal Engine data, AAA game collection, and real-world augmentation; a training framework using residual prediction with imperfect-frame re-injection plus camera-aware memory retrieval/injection for long-horizon spatiotemporal consistency; and multi-segment DMD distillation combined with quantization and VAE pruning for efficient inference. The central claim is that the 5B model achieves up to 40 FPS real-time generation while maintaining stable memory consistency over minute-long sequences, with further gains from scaling to a 2x14B model.

Significance. If independently verified, the combination of real-time high-resolution inference with demonstrated minute-scale consistency would constitute a practical advance for deployable world models in interactive applications. The explicit engineering of self-correction via residual modeling and camera-aware retrieval, together with the large-scale quadruplet data pipeline, supplies a concrete recipe that could be adopted or extended by others working on streaming video generation.

major comments (3)

[Abstract and §4] Abstract and §4 (Experimental results): The headline performance figures (40 FPS at 720p with the 5B model and stable consistency over minute-long sequences) are reported without any quantitative baseline comparisons, error bars, ablation tables, or failure-case analysis. This absence makes it impossible to isolate the contribution of the residual-prediction objective, the camera-aware memory module, or the DMD distillation from the overall pipeline.
[§3.2] §3.2 (Training framework for long-horizon consistency): The claim that re-injecting imperfect generated frames produces a robust self-correction attractor rests on the assumption that the residual objective generalizes outside the synthetic/game quadruplet distribution. No experiments or analysis are provided that test for compounding spatiotemporal drift when camera poses or scene dynamics deviate from the training data, which directly bears on the minute-scale consistency claim.
[§3.1 and §4] §3.1 (Data engine) and §4: The Video-Pose-Action-Prompt quadruplet engine is described as the foundation for both training and evaluation, yet no quantitative metrics (e.g., diversity statistics, pose-estimation accuracy, or distribution-shift measures) are supplied to show how it differs from prior game or synthetic datasets, nor are any cross-dataset generalization results reported.

minor comments (2)

[Abstract] The notation “2x14B model” is ambiguous; clarify whether this denotes an ensemble, a mixture-of-experts architecture, or simply two independent 14B models.
[Figures] Figure captions and the inference pipeline diagram should explicitly label the memory retrieval/injection points and the DMD segment boundaries to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. The comments identify key areas where additional evidence would strengthen the manuscript. We address each major comment below and will incorporate the suggested revisions in the next version.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experimental results): The headline performance figures (40 FPS at 720p with the 5B model and stable consistency over minute-long sequences) are reported without any quantitative baseline comparisons, error bars, ablation tables, or failure-case analysis. This absence makes it impossible to isolate the contribution of the residual-prediction objective, the camera-aware memory module, or the DMD distillation from the overall pipeline.

Authors: We agree that the absence of direct quantitative baselines and component ablations limits the ability to isolate contributions. In the revised manuscript we will add a dedicated comparison table against Matrix-Game 2.0 and other published real-time video generation methods, reporting both speed and long-horizon consistency metrics. We will also include ablation tables that separately disable residual self-correction, camera-aware memory retrieval, and the multi-segment DMD stage, together with error bars computed over multiple evaluation seeds. A short failure-case analysis with representative drift examples will be added to the experimental section. revision: yes
Referee: [§3.2] §3.2 (Training framework for long-horizon consistency): The claim that re-injecting imperfect generated frames produces a robust self-correction attractor rests on the assumption that the residual objective generalizes outside the synthetic/game quadruplet distribution. No experiments or analysis are provided that test for compounding spatiotemporal drift when camera poses or scene dynamics deviate from the training data, which directly bears on the minute-scale consistency claim.

Authors: The residual objective is trained on imperfect frames produced by the model itself within the quadruplet distribution, which already contains substantial variation in pose and dynamics. Nevertheless, we acknowledge the lack of explicit out-of-distribution tests. In the revision we will add controlled experiments that perturb camera trajectories and introduce scene elements outside the training distribution, then measure spatiotemporal drift over minute-scale rollouts. These results will be reported alongside the existing consistency metrics to directly address generalization of the self-correction mechanism. revision: yes
Referee: [§3.1 and §4] §3.1 (Data engine) and §4: The Video-Pose-Action-Prompt quadruplet engine is described as the foundation for both training and evaluation, yet no quantitative metrics (e.g., diversity statistics, pose-estimation accuracy, or distribution-shift measures) are supplied to show how it differs from prior game or synthetic datasets, nor are any cross-dataset generalization results reported.

Authors: We agree that quantitative characterization of the data pipeline would help readers assess its novelty and coverage. In the revised manuscript we will report diversity statistics (scene category coverage, action entropy, camera trajectory variance), pose-estimation accuracy on a held-out validation set, and distribution-shift metrics (e.g., Fréchet video distance) relative to prior game and synthetic datasets. We will also include cross-dataset generalization results by evaluating the trained model on external real-world video sequences without additional fine-tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed results or methods

full rationale

The manuscript describes an empirical engineering pipeline (data engine, residual modeling with imperfect-frame re-injection, camera-aware retrieval, and DMD-based distillation) and reports measured outcomes (40 FPS at 720p, minute-scale consistency) from running that pipeline on its own data and models. No mathematical derivation, equation, or theorem is presented that reduces by construction to its own inputs; no load-bearing self-citations or uniqueness theorems are invoked; performance figures are direct experimental measurements rather than independent predictions. The work is therefore self-contained as a system report.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claims rest on the assumption that the described residual self-correction and camera-aware memory mechanisms generalize beyond the authors' curated quadruplet data; several model sizes and distillation hyperparameters are introduced without external grounding.

free parameters (2)

5B and 2x14B model sizes
Chosen to balance speed and quality; no derivation given for these exact capacities.
DMD distillation segments and quantization bits
Tuned to reach 40 FPS; values are not derived from first principles.

axioms (2)

domain assumption Re-injecting imperfect frames during training teaches reliable self-correction
Invoked in the training framework section of the abstract without proof or external validation.
domain assumption Camera-aware memory retrieval preserves spatiotemporal consistency over minutes
Central to the long-horizon claim; treated as effective once implemented.

invented entities (2)

Video-Pose-Action-Prompt quadruplet data engine no independent evidence
purpose: Industrial-scale training data source combining synthetic and real video
New data pipeline introduced to support the memory-augmented model
camera-aware memory retrieval and injection module no independent evidence
purpose: Enable long-horizon consistency by storing and retrieving past camera states
Core architectural addition not present in standard diffusion video models

pith-pipeline@v0.9.0 · 5684 in / 1625 out tokens · 42327 ms · 2026-05-10T18:20:41.173890+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear
Error Buffer ... δ = ˆxi − xi ... ˜xi = xi + γδ

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
cs.CV 2026-05 conditional novelty 7.0

HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 5.0

The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 3.0

This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.

Reference graph

Works this paper leans on

60 extracted references · 37 canonical work pages · cited by 2 Pith papers · 11 internal anchors

[1]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 17

work page internal anchor Pith review arXiv 2025
[2]

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...

2025
[3]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning, 2024

2024
[4]

Yuille, Leonidas J

Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan L. Yuille, Leonidas J. Guibas, Maneesh Agrawala, Lu Jiang, and Gordon Wetzstein. Mixture of contexts for long video generation. InInternational Conference on Learning Representations (ICLR), 2026

2026
[5]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

2024
[6]

arXiv preprint arXiv:2506.01103 (2025) 4

Junyi Chen, Haoyi Zhu, Xianglong He, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Zhoujie Fu, Jiangmiao Pang, et al. Deepverse: 4d autoregressive video generation as a world model. arXiv preprint arXiv:2506.01103, 2025

work page arXiv 2025
[7]

Lightx2v: Light video generation inference framework, 2025

LightX2V Contributors. Lightx2v: Light video generation inference framework, 2025. GitHub repository

2025
[8]

Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

work page arXiv 2025
[9]

Lol: Longer than longer, scaling video generation to hour, 2026

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Lol: Longer than longer, scaling video generation to hour, 2026

2026
[10]

Oasis: A universe in a transformer

Decart. Oasis: A universe in a transformer. 2024

2024
[11]

The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024

Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control. arXiv preprint arXiv:2412.03568, 2024

work page arXiv 2024
[12]

arXiv preprint arXiv:2602.06949 , year=

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

work page arXiv 2026
[13]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

work page Pith review arXiv 2026
[14]

Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025

work page arXiv 2025
[15]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

work page arXiv 2025
[17]

AstraNav-World: World Model for Foresight Control and Consistency

Junjun Hu, Jintao Chen, Haochen Bai, Minghua Luo, Shichao Xie, Ziyi Chen, Fei Liu, Zedong Chu, Xinda Xue, Botao Ren, et al. Astranav-world: World model for foresight control and consistency.arXiv preprint arXiv:2512.21714, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

arXiv preprint arXiv:2508.10934 (2025)

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception. arXiv preprint arXiv:2508.10934, 2025

work page arXiv 2025
[19]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 18

work page internal anchor Pith review arXiv 2025
[20]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

work page internal anchor Pith review arXiv 2026
[21]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

World Labs. Marble. https://www.worldlabs.ai/blog/marble-world-model, 2025. Accessed: 2026-03-27

2025
[23]

Vmem: Consistent interactive video scene generation with surfel-indexed view memory

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25690–25699, 2025

2025
[24]

Stable video infinity: Infinite-length video generation with error recycling.arXiv preprint arXiv:2510.09212,

Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite- length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

work page arXiv 2025
[25]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

2024
[26]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision, 2023

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, and Aniket Bera. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision, 2023

2023
[27]

Yume-1.5: A text-controlled interactive world generation model,

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

work page arXiv 2025
[28]

arXiv preprint arXiv:2601.07823 , year=

Zhiting Mei, Tenny Yin, Ola Shorinwa, Apurva Badithela, Zhonghe Zheng, Joseph Bruno, Madison Bland, Lihan Zha, Asher Hancock, Jaime Fernández Fisac, et al. Video generation models in robotics-applications, research challenges, future directions.arXiv preprint arXiv:2601.07823, 2026

work page arXiv 2026
[29]

Sora: Video generation models as world simulators

OpenAI. Sora: Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators/, 2024

2024
[30]

Genie 2: A large-scale foundation world model.URL: https://deepmind

J Parker-Holder, P Ball, J Bruce, V Dasagi, K Holsheimer, C Kaplanis, A Moufarek, G Scully, J Shar, J Shi, et al. Genie 2: A large-scale foundation world model.URL: https://deepmind. google/discover/blog/genie- 2-a-large-scale-foundation-world-model, 2024

2024
[31]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision, pages 4195–4205, 2023

2023
[32]

Worldplay: Towards long-term geometric consistency for real-time interactive world modeling,

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025

work page arXiv 2025
[34]

Hunyuan-gamecraft-2: Instruction-following interactive game world model,

Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Linfeng Zhang, et al. Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

work page arXiv 2025
[35]

Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, et al. Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

work page arXiv 2025
[36]

Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

work page arXiv 2025
[37]

Advancing open-source world models, 2026

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models, 2026

2026
[38]

Advancing open-source world models,

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026

work page arXiv 2026
[39]

Deep patch visual odometry, 2023

Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch visual odometry, 2023

2023
[40]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025. 19

work page internal anchor Pith review arXiv 2025
[41]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676,

Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, et al. Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025

work page arXiv 2025
[43]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Worldcompass: Reinforcement learning for long-horizon world models.arXiv preprint arXiv:2602.09022, 2026

Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, et al. Worldcompass: Reinforcement learning for long-horizon world models.arXiv preprint arXiv:2602.09022, 2026

work page arXiv 2026
[45]

Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

work page arXiv 2025
[46]

Matrix-3d: Omnidirectional explorable 3d world generation.arXiv preprint arXiv:2508.08086, 2025

Zhongqi Yang, Wenhang Ge, Yuqi Li, Jiaqi Chen, Haoyuan Li, Mengyin An, Fei Kang, Hua Xue, Baixin Xu, Yuyang Yin, et al. Matrix-3d: Omnidirectional explorable 3d world generation.arXiv preprint arXiv:2508.08086, 2025

work page arXiv 2025
[47]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review arXiv 2026
[48]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

2024
[49]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. 2025

2025
[50]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. In Proceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

2025
[51]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. In SIGGRAPH Asia 2025 Conference Papers, pages 19:1–19:11, 2025

2025
[52]

Gamefactory: Creating new games with generative interactive videos

Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with generative interactive videos. InInternational Conference on Computer Vision, 2025

2025
[53]

Haoqi Yuan, Yu Bai, Yuhui Fu, Bohan Zhou, Yicheng Feng, Xinrun Xu, Yi Zhan, Börje F

Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg, et al. Mosaicmem: Hybrid spatial memory for controllable video world models. arXiv preprint arXiv:2603.17117, 2026

work page arXiv 2026
[54]

Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan L

Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M Patel, Paul Pu Liang, et al. World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025

work page arXiv 2025
[55]

Astrolabe: Steering forward-process reinforcement learning for distilled autoregressive video models.arXiv preprint arXiv:2603.17051, 2026

Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, and Anyi Rao. Astrolabe: Steering forward-process reinforcement learning for distilled autoregressive video models.arXiv preprint arXiv:2603.17051, 2026

work page arXiv 2026
[56]

Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

work page arXiv 2025
[57]

Realcam-vid: High-resolution video dataset with dynamic scenes and metric-scale camera movements, 2025

Guangcong Zheng, Teng Li, Xianpan Zhou, and Xi Li. Realcam-vid: High-resolution video dataset with dynamic scenes and metric-scale camera movements, 2025

2025
[58]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

work page internal anchor Pith review arXiv 2018
[59]

Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

work page arXiv 2025
[60]

Causal forcing: Autoregres- sive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregres- sive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026

work page arXiv 2026
[61]

Turbo-vaed: Fast and stable transfer of video-vaes to mobile devices

Ya Zou, Jingfeng Yao, Siyuan Yu, Shuai Zhang, Wenyu Liu, and Xinggang Wang. Turbo-vaed: Fast and stable transfer of video-vaes to mobile devices. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 14086–14094, 2026. 20

2026