Recognition: 2 theorem links
· Lean TheoremMatrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
Pith reviewed 2026-05-10 18:20 UTC · model grok-4.3
The pith
Matrix-Game 3.0 generates 720p interactive video in real time at 40 frames per second while holding memory consistency over minute-long sequences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By combining an upgraded infinite data engine, residual modeling with re-injection of imperfect frames for self-correction, camera-aware memory retrieval and injection, and DMD-based multi-segment autoregressive distillation with quantization and pruning, Matrix-Game 3.0 achieves up to 40 FPS real-time 720p generation using a 5B model and maintains stable memory consistency over minute-long sequences.
What carries the argument
Camera-aware memory retrieval and injection combined with residual prediction modeling that re-injects generated frames during training to enable self-correction.
If this is right
- Interactive applications can sustain long-form video generation at real-time speeds without resets or loss of consistency.
- Larger models trained with the same residual and memory methods show improved dynamics and generalization.
- The approach supplies a direct route to deployable industrial-scale world models for simulation and gaming.
- Real-time high-resolution output becomes practical for streaming interactive scenarios.
Where Pith is reading between the lines
- The residual re-injection method may transfer to other video diffusion models to lengthen their reliable generation horizon without extra supervision.
- Camera-aware memory retrieval indicates that explicit viewpoint conditioning is key to preventing drift in 3D-consistent world models.
- If self-correction from noisy self-generated data works reliably, training loops could increasingly rely on the model's own outputs rather than only clean ground truth.
Load-bearing premise
Re-injecting imperfect generated frames during training plus camera-aware memory retrieval will produce long-horizon spatiotemporal consistency without visible drift or compounding errors once the model leaves the training distribution.
What would settle it
Generating minute-long interactive sequences on novel out-of-distribution actions or environments and measuring whether object positions, camera trajectories, and visual details remain consistent without accumulated artifacts or drift.
Figures
read the original abstract
With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Matrix-Game 3.0, a memory-augmented interactive world model extending Matrix-Game 2.0 for 720p real-time long-form video generation. It introduces an industrial-scale data engine generating Video-Pose-Action-Prompt quadruplets from synthetic Unreal Engine data, AAA game collection, and real-world augmentation; a training framework using residual prediction with imperfect-frame re-injection plus camera-aware memory retrieval/injection for long-horizon spatiotemporal consistency; and multi-segment DMD distillation combined with quantization and VAE pruning for efficient inference. The central claim is that the 5B model achieves up to 40 FPS real-time generation while maintaining stable memory consistency over minute-long sequences, with further gains from scaling to a 2x14B model.
Significance. If independently verified, the combination of real-time high-resolution inference with demonstrated minute-scale consistency would constitute a practical advance for deployable world models in interactive applications. The explicit engineering of self-correction via residual modeling and camera-aware retrieval, together with the large-scale quadruplet data pipeline, supplies a concrete recipe that could be adopted or extended by others working on streaming video generation.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experimental results): The headline performance figures (40 FPS at 720p with the 5B model and stable consistency over minute-long sequences) are reported without any quantitative baseline comparisons, error bars, ablation tables, or failure-case analysis. This absence makes it impossible to isolate the contribution of the residual-prediction objective, the camera-aware memory module, or the DMD distillation from the overall pipeline.
- [§3.2] §3.2 (Training framework for long-horizon consistency): The claim that re-injecting imperfect generated frames produces a robust self-correction attractor rests on the assumption that the residual objective generalizes outside the synthetic/game quadruplet distribution. No experiments or analysis are provided that test for compounding spatiotemporal drift when camera poses or scene dynamics deviate from the training data, which directly bears on the minute-scale consistency claim.
- [§3.1 and §4] §3.1 (Data engine) and §4: The Video-Pose-Action-Prompt quadruplet engine is described as the foundation for both training and evaluation, yet no quantitative metrics (e.g., diversity statistics, pose-estimation accuracy, or distribution-shift measures) are supplied to show how it differs from prior game or synthetic datasets, nor are any cross-dataset generalization results reported.
minor comments (2)
- [Abstract] The notation “2x14B model” is ambiguous; clarify whether this denotes an ensemble, a mixture-of-experts architecture, or simply two independent 14B models.
- [Figures] Figure captions and the inference pipeline diagram should explicitly label the memory retrieval/injection points and the DMD segment boundaries to improve readability.
Simulated Author's Rebuttal
We thank the referee for the thorough and constructive review. The comments identify key areas where additional evidence would strengthen the manuscript. We address each major comment below and will incorporate the suggested revisions in the next version.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experimental results): The headline performance figures (40 FPS at 720p with the 5B model and stable consistency over minute-long sequences) are reported without any quantitative baseline comparisons, error bars, ablation tables, or failure-case analysis. This absence makes it impossible to isolate the contribution of the residual-prediction objective, the camera-aware memory module, or the DMD distillation from the overall pipeline.
Authors: We agree that the absence of direct quantitative baselines and component ablations limits the ability to isolate contributions. In the revised manuscript we will add a dedicated comparison table against Matrix-Game 2.0 and other published real-time video generation methods, reporting both speed and long-horizon consistency metrics. We will also include ablation tables that separately disable residual self-correction, camera-aware memory retrieval, and the multi-segment DMD stage, together with error bars computed over multiple evaluation seeds. A short failure-case analysis with representative drift examples will be added to the experimental section. revision: yes
-
Referee: [§3.2] §3.2 (Training framework for long-horizon consistency): The claim that re-injecting imperfect generated frames produces a robust self-correction attractor rests on the assumption that the residual objective generalizes outside the synthetic/game quadruplet distribution. No experiments or analysis are provided that test for compounding spatiotemporal drift when camera poses or scene dynamics deviate from the training data, which directly bears on the minute-scale consistency claim.
Authors: The residual objective is trained on imperfect frames produced by the model itself within the quadruplet distribution, which already contains substantial variation in pose and dynamics. Nevertheless, we acknowledge the lack of explicit out-of-distribution tests. In the revision we will add controlled experiments that perturb camera trajectories and introduce scene elements outside the training distribution, then measure spatiotemporal drift over minute-scale rollouts. These results will be reported alongside the existing consistency metrics to directly address generalization of the self-correction mechanism. revision: yes
-
Referee: [§3.1 and §4] §3.1 (Data engine) and §4: The Video-Pose-Action-Prompt quadruplet engine is described as the foundation for both training and evaluation, yet no quantitative metrics (e.g., diversity statistics, pose-estimation accuracy, or distribution-shift measures) are supplied to show how it differs from prior game or synthetic datasets, nor are any cross-dataset generalization results reported.
Authors: We agree that quantitative characterization of the data pipeline would help readers assess its novelty and coverage. In the revised manuscript we will report diversity statistics (scene category coverage, action entropy, camera trajectory variance), pose-estimation accuracy on a held-out validation set, and distribution-shift metrics (e.g., Fréchet video distance) relative to prior game and synthetic datasets. We will also include cross-dataset generalization results by evaluating the trained model on external real-world video sequences without additional fine-tuning. revision: yes
Circularity Check
No significant circularity in claimed results or methods
full rationale
The manuscript describes an empirical engineering pipeline (data engine, residual modeling with imperfect-frame re-injection, camera-aware retrieval, and DMD-based distillation) and reports measured outcomes (40 FPS at 720p, minute-scale consistency) from running that pipeline on its own data and models. No mathematical derivation, equation, or theorem is presented that reduces by construction to its own inputs; no load-bearing self-citations or uniqueness theorems are invoked; performance figures are direct experimental measurements rather than independent predictions. The work is therefore self-contained as a system report.
Axiom & Free-Parameter Ledger
free parameters (2)
- 5B and 2x14B model sizes
- DMD distillation segments and quantization bits
axioms (2)
- domain assumption Re-injecting imperfect frames during training teaches reliable self-correction
- domain assumption Camera-aware memory retrieval preserves spatiotemporal consistency over minutes
invented entities (2)
-
Video-Pose-Action-Prompt quadruplet data engine
no independent evidence
-
camera-aware memory retrieval and injection module
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearby modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclearError Buffer ... δ = ˆxi − xi ... ˜xi = xi + γδ
Forward citations
Cited by 3 Pith papers
-
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.
Reference graph
Works this paper leans on
-
[1]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 17
work page internal anchor Pith review arXiv 2025
-
[2]
Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...
2025
-
[3]
Genie: Generative interactive environments
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning, 2024
2024
-
[4]
Yuille, Leonidas J
Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan L. Yuille, Leonidas J. Guibas, Maneesh Agrawala, Lu Jiang, and Gordon Wetzstein. Mixture of contexts for long video generation. InInternational Conference on Learning Representations (ICLR), 2026
2026
-
[5]
Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024
Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024
2024
-
[6]
arXiv preprint arXiv:2506.01103 (2025) 4
Junyi Chen, Haoyi Zhu, Xianglong He, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Zhoujie Fu, Jiangmiao Pang, et al. Deepverse: 4d autoregressive video generation as a world model. arXiv preprint arXiv:2506.01103, 2025
-
[7]
Lightx2v: Light video generation inference framework, 2025
LightX2V Contributors. Lightx2v: Light video generation inference framework, 2025. GitHub repository
2025
-
[8]
Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025
-
[9]
Lol: Longer than longer, scaling video generation to hour, 2026
Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Lol: Longer than longer, scaling video generation to hour, 2026
2026
-
[10]
Oasis: A universe in a transformer
Decart. Oasis: A universe in a transformer. 2024
2024
-
[11]
Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control. arXiv preprint arXiv:2412.03568, 2024
-
[12]
arXiv preprint arXiv:2602.06949 , year=
Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026
-
[13]
LTX-2: Efficient Joint Audio-Visual Foundation Model
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026
work page Pith review arXiv 2026
-
[14]
Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025
Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025
-
[15]
Matrix-game 2.0: An open-source real-time and streaming interactive world model
Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025
Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025
-
[17]
AstraNav-World: World Model for Foresight Control and Consistency
Junjun Hu, Jintao Chen, Haochen Bai, Minghua Luo, Shichao Xie, Ziyi Chen, Fei Liu, Zedong Chu, Xinda Xue, Botao Ren, et al. Astranav-world: World model for foresight control and consistency.arXiv preprint arXiv:2512.21714, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
arXiv preprint arXiv:2508.10934 (2025)
Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception. arXiv preprint arXiv:2508.10934, 2025
-
[19]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 18
work page internal anchor Pith review arXiv 2025
-
[20]
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026
work page internal anchor Pith review arXiv 2026
-
[21]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
World Labs. Marble. https://www.worldlabs.ai/blog/marble-world-model, 2025. Accessed: 2026-03-27
2025
-
[23]
Vmem: Consistent interactive video scene generation with surfel-indexed view memory
Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25690–25699, 2025
2025
-
[24]
Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite- length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025
-
[25]
Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024
2024
-
[26]
Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision, 2023
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, and Aniket Bera. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision, 2023
2023
-
[27]
Yume-1.5: A text-controlled interactive world generation model,
Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025
-
[28]
arXiv preprint arXiv:2601.07823 , year=
Zhiting Mei, Tenny Yin, Ola Shorinwa, Apurva Badithela, Zhonghe Zheng, Joseph Bruno, Madison Bland, Lihan Zha, Asher Hancock, Jaime Fernández Fisac, et al. Video generation models in robotics-applications, research challenges, future directions.arXiv preprint arXiv:2601.07823, 2026
-
[29]
Sora: Video generation models as world simulators
OpenAI. Sora: Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators/, 2024
2024
-
[30]
Genie 2: A large-scale foundation world model.URL: https://deepmind
J Parker-Holder, P Ball, J Bruce, V Dasagi, K Holsheimer, C Kaplanis, A Moufarek, G Scully, J Shar, J Shi, et al. Genie 2: A large-scale foundation world model.URL: https://deepmind. google/discover/blog/genie- 2-a-large-scale-foundation-world-model, 2024
2024
-
[31]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision, pages 4195–4205, 2023
2023
-
[32]
Worldplay: Towards long-term geometric consistency for real-time interactive world modeling,
Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling.arXiv preprint arXiv:2512.14614, 2025
-
[34]
Hunyuan-gamecraft-2: Instruction-following interactive game world model,
Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Linfeng Zhang, et al. Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025
-
[35]
HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, et al. Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025
-
[36]
Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025
Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025
-
[37]
Advancing open-source world models, 2026
Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, Yihang Chen, Jie Liu, Yansong Cheng, Yao Yao, Jiayi Zhu, Yihao Meng, Kecheng Zheng, Qingyan Bai, Jingye Chen, Zehong Shen, Yue Yu, Xing Zhu, Yujun Shen, and Hao Ouyang. Advancing open-source world models, 2026
2026
-
[38]
Advancing open-source world models,
Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026
-
[39]
Deep patch visual odometry, 2023
Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch visual odometry, 2023
2023
-
[40]
MAGI-1: Autoregressive Video Generation at Scale
Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025. 19
work page internal anchor Pith review arXiv 2025
-
[41]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676,
Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, et al. Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025
-
[43]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, et al. Worldcompass: Reinforcement learning for long-horizon world models.arXiv preprint arXiv:2602.09022, 2026
-
[45]
Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025
Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025
-
[46]
Matrix-3d: Omnidirectional explorable 3d world generation.arXiv preprint arXiv:2508.08086, 2025
Zhongqi Yang, Wenhang Ge, Yuqi Li, Jiaqi Chen, Haoyuan Li, Mengyin An, Fei Kang, Hua Xue, Baixin Xu, Yuyang Yin, et al. Matrix-3d: Omnidirectional explorable 3d world generation.arXiv preprint arXiv:2508.08086, 2025
-
[47]
World Action Models are Zero-shot Policies
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026
work page internal anchor Pith review arXiv 2026
-
[48]
One-step diffusion with distribution matching distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024
2024
-
[49]
From slow bidirectional to fast autoregressive video diffusion models
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. 2025
2025
-
[50]
Context as memory: Scene-consistent interactive long video generation with memory retrieval
Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. In Proceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025
2025
-
[51]
Context as memory: Scene-consistent interactive long video generation with memory retrieval
Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. In SIGGRAPH Asia 2025 Conference Papers, pages 19:1–19:11, 2025
2025
-
[52]
Gamefactory: Creating new games with generative interactive videos
Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Gamefactory: Creating new games with generative interactive videos. InInternational Conference on Computer Vision, 2025
2025
-
[53]
Haoqi Yuan, Yu Bai, Yuhui Fu, Bohan Zhou, Yicheng Feng, Xinrun Xu, Yi Zhan, Börje F
Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg, et al. Mosaicmem: Hybrid spatial memory for controllable video world models. arXiv preprint arXiv:2603.17117, 2026
-
[54]
Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan L
Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M Patel, Paul Pu Liang, et al. World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025
-
[55]
Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, and Anyi Rao. Astrolabe: Steering forward-process reinforcement learning for distilled autoregressive video models.arXiv preprint arXiv:2603.17051, 2026
-
[56]
Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025
Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, and Yahui Zhou. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025
-
[57]
Realcam-vid: High-resolution video dataset with dynamic scenes and metric-scale camera movements, 2025
Guangcong Zheng, Teng Li, Xianpan Zhou, and Xi Li. Realcam-vid: High-resolution video dataset with dynamic scenes and metric-scale camera movements, 2025
2025
-
[58]
Stereo Magnification: Learning View Synthesis using Multiplane Images
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018
work page internal anchor Pith review arXiv 2018
-
[59]
Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025
-
[60]
Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregres- sive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026
-
[61]
Turbo-vaed: Fast and stable transfer of video-vaes to mobile devices
Ya Zou, Jingfeng Yao, Siyuan Yu, Shuai Zhang, Wenyu Liu, and Xinggang Wang. Turbo-vaed: Fast and stable transfer of video-vaes to mobile devices. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 14086–14094, 2026. 20
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.