pith. machine review for the scientific record. sign in

arxiv: 2601.20540 · v1 · submitted 2026-01-28 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Advancing Open-source World Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords world modelvideo generationopen-source simulatorlong-term memoryreal-time interactionrobot learningsimulation
0
0 comments X

The pith

LingBot-World is an open-source world simulator from video generation that claims high fidelity across environments, minute-scale temporal consistency, and sub-second latency at 16 frames per second.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LingBot-World as a publicly released world simulator built on video generation methods. It asserts that the model sustains detailed and stable dynamics in realistic, scientific, cartoon, and other settings while maintaining context over minute-long sequences. The work also states that it supports interactive use with generation latency below one second when outputting 16 frames per second. By releasing the code and weights, the authors intend to reduce the performance gap between open and closed systems for tasks in content creation, gaming, and robot learning.

Core claim

LingBot-World is positioned as a top-tier open-source world model that maintains high fidelity and robust dynamics in a broad spectrum of environments, enables a minute-level horizon while preserving contextual consistency known as long-term memory, and supports real-time interactivity with latency under one second when producing 16 frames per second.

What carries the argument

LingBot-World, a video-generation-derived world simulator that produces interactive sequences while preserving dynamics and long-range context.

Load-bearing premise

The released model actually delivers the claimed fidelity, minute-scale consistency, and sub-second latency across the listed environment types.

What would settle it

Independent tests of the released model that show loss of contextual consistency before one minute of generated video or measured latency exceeding one second at 16 fps.

read the original abstract

We present LingBot-World, an open-sourced world simulator stemming from video generation. Positioned as a top-tier world model, LingBot-World offers the following features. (1) It maintains high fidelity and robust dynamics in a broad spectrum of environments, including realism, scientific contexts, cartoon styles, and beyond. (2) It enables a minute-level horizon while preserving contextual consistency over time, which is also known as "long-term memory". (3) It supports real-time interactivity, achieving a latency of under 1 second when producing 16 frames per second. We provide public access to the code and model in an effort to narrow the divide between open-source and closed-source technologies. We believe our release will empower the community with practical applications across areas like content creation, gaming, and robot learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents LingBot-World, an open-sourced world simulator derived from video generation. It positions the model as top-tier by claiming (1) high fidelity and robust dynamics across realism, scientific, cartoon, and other environments, (2) minute-level temporal horizon with preserved contextual consistency ('long-term memory'), and (3) real-time interactivity with latency under 1 second at 16 fps. The work describes the architecture and announces public release of code and model weights to support applications in content creation, gaming, and robot learning.

Significance. An open release of a world model with claimed long-horizon consistency and real-time performance could meaningfully advance open-source capabilities in computer vision and embodied AI, narrowing the gap with closed-source systems. The emphasis on broad environmental coverage and minute-scale memory addresses recognized challenges in video-based simulation. However, the complete absence of any quantitative evaluation in the manuscript prevents assessment of these claims and therefore limits the work's immediate scientific value.

major comments (2)
  1. [Abstract] Abstract: The three enumerated performance claims (high fidelity across diverse environments, minute-level horizon with contextual consistency, and <1 s latency at 16 fps) are stated without any supporting quantitative results, error metrics, ablation studies, or baseline comparisons. No FVD, FID, CLIP similarity, or long-horizon coherence numbers are reported, rendering the 'top-tier' positioning unverifiable from the manuscript.
  2. [Results/Experiments (absent)] The manuscript contains no results or experiments section. No tables, figures, or text report performance metrics, latency profiling on specified hardware, or comparisons to models such as VideoPoet, Genie, or Stable Video Diffusion. This omission is load-bearing because the central contribution is the claimed superiority in fidelity, consistency, and speed.
minor comments (2)
  1. [Method] The architecture description would benefit from explicit citations to the specific video generation backbones it builds upon and from a clearer statement of any novel modifications introduced for world modeling.
  2. [Conclusion] The public code and model release is noted but the manuscript does not specify the exact repository URL, license, or hardware requirements for reproducing the claimed latency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which correctly identifies the need for quantitative support of our performance claims. We address each major comment below and commit to revisions that will strengthen the manuscript's scientific value.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The three enumerated performance claims (high fidelity across diverse environments, minute-level horizon with contextual consistency, and <1 s latency at 16 fps) are stated without any supporting quantitative results, error metrics, ablation studies, or baseline comparisons. No FVD, FID, CLIP similarity, or long-horizon coherence numbers are reported, rendering the 'top-tier' positioning unverifiable from the manuscript.

    Authors: We agree that the abstract presents the three performance claims without accompanying quantitative metrics, which limits immediate verifiability. The current manuscript prioritizes architectural description and the open-source release, supported by qualitative examples of fidelity, consistency, and interactivity. In revision we will update the abstract to explicitly reference the new quantitative results (FVD, FID, long-horizon coherence scores, and latency benchmarks) that will appear in the added Experiments section, allowing direct assessment of the claims. revision: yes

  2. Referee: [Results/Experiments (absent)] The manuscript contains no results or experiments section. No tables, figures, or text report performance metrics, latency profiling on specified hardware, or comparisons to models such as VideoPoet, Genie, or Stable Video Diffusion. This omission is load-bearing because the central contribution is the claimed superiority in fidelity, consistency, and speed.

    Authors: We acknowledge that the original manuscript lacks a dedicated results or experiments section and does not report the requested metrics or baseline comparisons. The submission focused on model architecture and public code/weights release to enable community use. To address this core concern we will add a full Experiments section containing quantitative evaluations (FVD, FID, coherence over minute-scale horizons), hardware-specific latency profiling at 16 fps, ablation studies, and direct comparisons to VideoPoet, Genie, and Stable Video Diffusion. These additions will substantiate the superiority claims. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; claims are direct assertions

full rationale

The manuscript describes LingBot-World as stemming from video generation and lists three performance features (high fidelity, minute-scale consistency, sub-second latency) but contains no equations, fitted parameters, derivations, or load-bearing steps that reduce to inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked in a way that creates circularity. The central claims rest on unverified assertions rather than any mathematical reduction, so the derivation chain is empty and the circularity score is 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, axioms, or new entities are described in the abstract; the contribution is a system release rather than a derivation.

pith-pipeline@v0.9.0 · 5518 in / 1205 out tokens · 46937 ms · 2026-05-16T09:02:34.641612+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

    cs.CV 2026-05 conditional novelty 7.0

    HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.

  2. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  3. MultiWorld: Scalable Multi-Agent Multi-View Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

  4. ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control

    cs.RO 2026-04 unverdicted novelty 6.0

    ExoActor uses exocentric video generation to implicitly model robot-environment-object interactions and converts the resulting videos into task-conditioned humanoid control sequences.

  5. Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception

    cs.CV 2026-04 unverdicted novelty 6.0

    Infrastructure-centric world models use roadside sensors' temporal depth to complement vehicle spatial breadth for better traffic simulation and prediction.

  6. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  7. From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation

    cs.CV 2026-04 unverdicted novelty 6.0

    Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.

  8. INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

    cs.CV 2026-04 unverdicted novelty 6.0

    INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...

  9. SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.

  10. UNICA: A Unified Neural Framework for Controllable 3D Avatars

    cs.CV 2026-04 unverdicted novelty 6.0

    UNICA unifies motion planning, rigging, physical simulation, and rendering into a single skeleton-free neural framework that produces next-frame 3D avatar geometry from action inputs and renders it with Gaussian splatting.

  11. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  12. SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

    cs.CV 2026-05 unverdicted novelty 5.0

    SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...

  13. Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models

    cs.RO 2026-05 unverdicted novelty 5.0

    Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.

  14. InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

    cs.CV 2026-03 unverdicted novelty 5.0

    InSpatio-WorldFM is a frame-independent generative model that uses explicit 3D anchors and spatial memory to deliver real-time multi-view consistent spatial intelligence via a three-stage training pipeline from pretra...

  15. Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

  16. HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

    cs.CV 2026-04 unverdicted novelty 4.0

    HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...

  17. Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    cs.CV 2026-04 unverdicted novelty 4.0

    Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

  18. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · cited by 18 Pith papers · 32 internal anchors

  1. [1]

    Diffusion for world modeling: Visual details matter in atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. InAdv. Neural Inform. Process. Syst., 2024

  2. [2]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xia...

  3. [3]

    Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

    Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, and Qifeng Chen. Scaling instruction-based video editing with a high-quality synthetic dataset.arXiv preprint arXiv:2510.15742, 2025

  4. [4]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InInt. Conf. Comput. Vis., 2021

  5. [5]

    Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker- Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, C...

  6. [6]

    Navigation world models

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. InIEEE Conf. Comput. Vis. Pattern Recog., 2025

  7. [7]

    Lumiere: A space-time diffusion model for video generation

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia, 2024

  8. [8]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  9. [9]

    Video generation models as world simulators.OpenAI Blog, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators.OpenAI Blog, 2024

  10. [10]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InInt. Conf. Mach. Learn., 2024

  11. [11]

    Pyscenedetect: An open-source video scene detection program and python library

    Brandon Castellano. Pyscenedetect: An open-source video scene detection program and python library. https://github.com/ Breakthrough/PySceneDetect, 2018

  12. [12]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InAdv. Neural Inform. Process. Syst., 2024

  13. [13]

    Vl-jepa: Joint embedding predictive architecture for vision-language.arXiv preprint arXiv:2512.10942, 2025

    Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, and Pascale Fung. Vl-jepa: Joint embedding predictive architecture for vision-language.arXiv preprint arXiv:2512.10942, 2025

  14. [14]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, Weiming Xiong, Wei Wang, Nuo Pang, Kang Kang, Zhiheng Xu, Yuzhe Jin, Yupeng Liang, Yubing Song, Peng Zhao, Boyuan Xu, Di Qiu, Debang Li, Zhengcong Fei, Yang Li, and Yahui Zhou. Skyreels-v2: Infinite-length film generative model...

  15. [15]

    Sharegpt4video: Improving video understanding and generation with better captions

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi Wang. Sharegpt4video: Improving video understanding and generation with better captions. InAdv. Neural Inform. Process. Syst., 2024

  16. [16]

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InIEEE Conf. Comput. Vis. Pattern Recog., 2024

  17. [17]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. InEur. Conf. Comput. Vis., 2018

  18. [18]

    Unreal Engine.https://www.unrealengine.com/, 2023

    Epic Games. Unreal Engine.https://www.unrealengine.com/, 2023. Accessed: 2026-01-25

  19. [19]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 2022. 27

  20. [20]

    A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025

    Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025

  21. [21]

    Generative adversarial nets

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InAdv. Neural Inform. Process. Syst., 2014

  22. [22]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Carti...

  23. [23]

    Photorealistic video generation with diffusion models

    Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. InEur. Conf. Comput. Vis., 2024

  24. [24]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

  25. [25]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  26. [26]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

  27. [27]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, Baixin Xu, Hao-Xiang Guo, Kaixiong Gong, Size Wu, Wei Li, Xuchen Song, Yang Liu, Yangguang Li, and Yahui Zhou. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

  28. [28]

    Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

    Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, and Hao Tan. Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

  29. [29]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023

  30. [30]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  31. [31]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models. InIEEE Conf. Comput. Vis. Pattern Recog., 2024

  32. [32]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023

  33. [33]

    Posenet: A convolutional network for real-time 6-dof camera relocalization

    Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. InInt. Conf. Comput. Vis., 2015

  34. [34]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 2023

  35. [35]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 2022

  36. [36]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020. 28

  37. [37]

    MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos. InIEEE Conf. Comput. Vis. Pattern Recog., 2025

  38. [38]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  39. [39]

    Diffusion adversarial post-training for one-step video generation

    Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation. InInt. Conf. Mach. Learn., 2025

  40. [40]

    Autoregressive adversarial post-training for real-time interactive video generation

    Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post-training for real-time interactive video generation.arXiv preprint arXiv:2506.09350, 2025

  41. [41]

    A survey: Learning embodied intelligence from physical simulators and world models.arXiv preprint arXiv:2507.00917, 2025

    Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, Wei Li, Wei Yin, Yao Yao, Jia Pan, Qiu Shen, Ruigang Yang, Xun Cao, and Qionghai Dai. A survey: Learning embodied intelligence from physical simulators and world models.arXiv preprint arXiv:2507.00917, 2025

  42. [42]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

  43. [43]

    Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678, 2025

    Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, and Min Zhang. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678, 2025

  44. [44]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

  45. [45]

    Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

    Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, and Kaipeng Zhang. Yume-1.5: A text-controlled interactive world generation model.arXiv preprint arXiv:2512.22096, 2025

  46. [46]

    Holocine: Holistic generation of cinematic multi-shot long video narratives.arXiv preprint arXiv:2510.20822, 2025

    Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, Yujun Shen, and Huamin Qu. Holocine: Holistic generation of cinematic multi-shot long video narratives.arXiv preprint arXiv:2510.20822, 2025

  47. [47]

    Which training methods for GANs do actually converge? InInt

    Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for GANs do actually converge? InInt. Conf. Mach. Learn., 2018

  48. [48]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InInt. Conf. Learn. Represent., 2023

  49. [49]

    Directx shader compiler

    Microsoft. Directx shader compiler. https://github.com/microsoft/DirectXShaderCompiler, 2017. Accessed: 2026-01- 25

  50. [50]

    Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 2021

  51. [51]

    A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications.arXiv preprint arXiv:2503.07137, 2025

    Siyuan Mu and Sen Lin. A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications.arXiv preprint arXiv:2503.07137, 2025

  52. [52]

    Cosmos World Foundation Model Platform for Physical AI

    NVIDIA. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  53. [53]

    World Simulation with Video Foundation Models for Physical AI

    NVIDIA. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

  54. [54]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  55. [55]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InInt. Conf. Comput. Vis., 2023

  56. [56]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInt. Conf. Mach. Learn., 2021

  57. [57]

    Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025

    Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, Seung Wook Kim, Jun Gao, Laura Leal-Taixe, Mike Chen, Sanja Fidler, and Huan Ling. Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025. 29

  58. [58]

    GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

    Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

  59. [59]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022

  60. [60]

    MAGI-1: Autoregressive Video Generation at Scale

    Sand.ai. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

  61. [61]

    Superglue: Learning feature matching with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InIEEE Conf. Comput. Vis. Pattern Recog., 2020

  62. [62]

    Structure-from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InIEEE Conf. Comput. Vis. Pattern Recog., 2016

  63. [63]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    ByteDance Seed. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

  64. [64]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

  65. [65]

    Consistency Models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.arXiv preprint arXiv:2303.01469, 2023

  66. [66]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012

  67. [67]

    Transnet v2: An effective deep network architecture for fast shot transition detection

    Tomás Soucek and Jakub Lokoc. Transnet v2: An effective deep network architecture for fast shot transition detection. InACM Int. Conf. Multimedia, 2024

  68. [68]

    WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

    Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling. arXiv preprint arXiv:2512.14614, 2025

  69. [69]

    Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

    Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, and Qinglin Lu. Hunyuan-gamecraft-2: Instruction-following interactive game world model.arXiv preprint arXiv:2511.23429, 2025

  70. [70]

    Gemini: A Family of Highly Capable Multimodal Models

    Google Gemini Team. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  71. [71]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Hunyuan Foundation Model Team. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  72. [72]

    Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025

    Meituan LongCat Team. Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025

  73. [73]

    Mirage 2.https://www.mirage2.org/

    Mirage Team. Mirage 2.https://www.mirage2.org/. Accessed: 2026-01-26

  74. [74]

    Pan: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025

    PAN Team. Pan: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025

  75. [75]

    Qwen3-VL Technical Report

    Qwen Team. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  76. [76]

    Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

    Seedance Team. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

  77. [77]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan Team. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  78. [78]

    Diffusion models are real-time game engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

  79. [79]

    The world is your canvas: Painting promptable events with reference images, trajectories, and text.arXiv preprint arXiv:2512.16924, 2025

    Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, Cheng Chen, Yanhong Zeng, Xing Zhu, Yujun Shen, and Qifeng Chen. The world is your canvas: Painting promptable events with reference images, trajectories, and text.arXiv preprint arXiv:2512.16924, 2025

  80. [80]

    Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025

    Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, Xiao-Xiao Long, Hao Zhu, Zhaoxiang Zhang, Xun Cao, and Yao Yao. Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025

Showing first 80 references.