pith. sign in

arxiv: 2606.01164 · v1 · pith:AWVDMIX6new · submitted 2026-05-31 · 💻 cs.CV

Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends

Pith reviewed 2026-06-28 17:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords interactive world modelingaction-conditioned generationvideo generation3D generationworld modelsbenchmarkstechnical challengessimulation controllability
0
0 comments X

The pith

User actions condition video and 3D generation to enable controllable interactive world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This review examines how world modeling becomes interactive when user actions are explicitly incorporated into the transitions of world states during video or 3D generation. The work organizes recent efforts by application scenarios, methods of state evolution, and scene modalities, then focuses on three technical challenges: action-conditioned controllability, long-horizon interactions with memory, and action-following responsiveness for real-time use. It compares existing benchmarks and metrics across open-world exploration, game engines, autonomous driving, and robotics, and identifies promising future directions for the field.

Core claim

Recent literature shows that embedding user actions directly into world state transitions creates interactive world modeling through an action-conditioned video or 3D generation paradigm. This approach increases controllability over how worlds evolve and lets users freely traverse, manipulate, navigate, and personalize those evolutions. The review maps trends in applications and modalities, details the three core challenges, surveys benchmarks in four domains, and suggests paths toward next-generation systems.

What carries the argument

Action-conditioned video or 3D generation paradigm that incorporates user actions into world state transitions to produce interactivity.

If this is right

  • Controllability over world evolutions increases in domains such as game engines, embodied AI, and autonomous driving.
  • Users can traverse, manipulate, and personalize simulated environments more freely.
  • Standardized benchmarks in four fields enable clearer comparison of progress on controllability and responsiveness.
  • Solving long-horizon memory and real-time action following will support more practical interactive applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improved memory mechanisms could allow simulations to support extended user sessions with consistent state across many actions.
  • The paradigm may connect to embodied systems by letting planners test action sequences in controllable virtual worlds before physical execution.
  • Natural language interfaces could combine with action conditioning to let users issue high-level instructions that shape scene evolution.

Load-bearing premise

The selected literature on application scenarios, technical challenges, and benchmarks represents the current state of interactive world modeling without major omissions or biases in coverage.

What would settle it

Publication of a new survey or set of papers that reveals substantial omitted literature, different dominant challenges, or contradictory benchmark results would show the review's coverage is incomplete.

Figures

Figures reproduced from arXiv: 2606.01164 by Ayush Tewari, Chaojun Ni, Chensheng Peng, Fangjinhua Wang, Jiuming Liu, Marc Pollefeys, Masayoshi Tomizuka, Mengmeng Liu, Per Ola Kristensson, Sitian Shen.

Figure 1
Figure 1. Figure 1: Overview of our survey structure. The survey begins by thoroughly reviewing recent research trends, including application scenarios, world states, interactive modalities. Then, three key technique bottlenecks are systematically extended, ranging from user controllability from actions, long￾horizon interactions, and real-time responsiveness. Finally, we list existing evaluation benchmarks and compare metric… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of action controllability between video diffusion models and interactive world models. Previous video diffusion mod￾els only exert one-shot instructions [41], while interactive world models leverage frame-level multi-round instructions [42]. transition function T can describe internal world evolutions as p(st+1|st, at). A policy can learn to choose actions that lead to high rewards R extracted f… view at source ↗
Figure 3
Figure 3. Figure 3: Transformation from video diffusion models to interactive world models. Unlike prior general video diffusion models which si￾multaneously output all video frames with bi-directional temporal cues, transforming them into interactive world models requires both causality establishment and action condition [51]. conduct interactive instructions. For example, IWS [50] uses teleoperation as an interface on which… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of various injection manners for camera actions. We classify main injection manners into four categories: (a) Concatenation with visual tokens [4]; (b) Scaling and Shifting [26]; (c) Camera controlled rendering or simulation [44]; and (d) Matrix transformation [122]. 4.3.1 Injection of Camera Pose or Trajectory Camera orientations and movements are particularly crucial in creating interactive op… view at source ↗
Figure 6
Figure 6. Figure 6: A timeline of works in achieving long-horizon consistency. We comprehensively review existing methods enabling long-term interactions with Memory Construction methods, Noise or Forcing-based methods, and Explicit 3D Reconstruction methods. Notably, existing approaches mostly use history frames autoregressively, which are not included here. the Compounding Errors. To achieve long-duration consis￾tency, vari… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of different memory constructions. Explicit mem￾ory storing 3DGS [88] or point clouds [44] has better geometry aware￾ness but degraded update feasibility, while implicit memory storing prior video frames has enhanced dynamic handling ability but poor geometric consistency. MosaicMem [98] designs a hybrid memory with mixed advantages from both. view indices, where K most frequently-indexing frame… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison among various diffusion distillation paradigms. Recent methods adopt various distillation approaches in DiT to strengthen long-term consistency, mitigate exposure bias, and enable real-time rollout. Teacher Forcing [130] uses ground truth context during training, causing train-inference mismatch. Diffusion Forcing [126] leverages levels of noise but struggles with real rollout errors. Self-Forci… view at source ↗
read the original abstract

With rapid development of large language models and diffusion-based content generation, world modeling has attracted increasing research attention, benefiting various downstream domains such as game engines, embodied AI, autonomous driving, etc. Through explicitly incorporating user actions into world state transition, recent literature empowers world modeling with interactivity in an action-conditioned video or 3D generation paradigm, further enhancing controllability over world evolutions and facilitating users to freely traverse, manipulate, navigate, and personalize the state evolution. In this paper, we aim to systematically review recent research trends, technical developments, evaluation benchmarks, and also propose future potential directions in interactive world modeling. Specifically, we first summarize recent efforts and trends in terms of application scenarios, world state evolution, and scene modality. Afterwards, we delve into three crucial technical challenges, including action-conditioned controllability, long-horizon interactions and memory, and action-following responsiveness for real-time interactivity. Furthermore, we also thoroughly compare existing benchmarks and metrics in four specific application fields: open-world exploration, game engine, autonomous driving, and robotics. Finally, we discuss several promising future directions in achieving next-generation interactive world modeling. The corresponding repository is publicly available at: https://github.com/liujiuming123/Awesome-Interactive-World-Model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper is a survey reviewing recent trends in interactive world modeling. It summarizes application scenarios, world state evolution, and scene modalities; examines three technical challenges (action-conditioned controllability, long-horizon interactions and memory, and action-following responsiveness); compares benchmarks and metrics across open-world exploration, game engines, autonomous driving, and robotics; and outlines future directions. The central thesis is that incorporating user actions into state transitions via action-conditioned video/3D generation enables greater interactivity and controllability.

Significance. If the literature selection proves representative, the survey would organize an emerging intersection of diffusion models, video generation, and interactive systems, providing a useful reference for researchers in embodied AI, autonomous driving, and game engines. The public GitHub repository strengthens its role as a living resource.

major comments (2)
  1. [Introduction / § on literature trends] The survey's claim to 'systematically review' the field (abstract and §1) is load-bearing on representativeness of the cited literature. No explicit literature search methodology, inclusion/exclusion criteria, or database sources are provided in the introduction or methods overview, making it impossible to assess potential omissions or selection bias in the coverage of scenarios, challenges, and the four benchmark domains.
  2. [Benchmarks comparison section] § on benchmarks: The 'thorough' comparison across four fields is presented without a summary table enumerating how many papers per field were reviewed or explicit criteria for benchmark inclusion. This weakens the ability to verify completeness of the cross-field analysis.
minor comments (2)
  1. [Abstract] Abstract: 'also propose future potential directions' is redundant; 'future directions' already implies potential.
  2. [Application scenarios / state evolution section] The three challenges are clearly listed, but the transition between 'world state evolution' and 'scene modality' subsections could benefit from an explicit diagram or table showing how modalities map to interactivity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and commit to revisions that improve transparency without altering the core contributions of the survey.

read point-by-point responses
  1. Referee: [Introduction / § on literature trends] The survey's claim to 'systematically review' the field (abstract and §1) is load-bearing on representativeness of the cited literature. No explicit literature search methodology, inclusion/exclusion criteria, or database sources are provided in the introduction or methods overview, making it impossible to assess potential omissions or selection bias in the coverage of scenarios, challenges, and the four benchmark domains.

    Authors: We acknowledge the validity of this observation. The manuscript does not include an explicit methods section detailing the literature search. While the survey is a narrative review drawing on prominent recent works in the rapidly evolving intersection of diffusion models and interactive systems, we agree that greater transparency is warranted. In the revised version, we will insert a dedicated paragraph in Section 1 that describes the literature collection approach: primary sources (arXiv, Google Scholar, CVPR/ICCV/ECCV proceedings 2022–2024), search terms (combinations of “action-conditioned video generation,” “interactive world model,” “action-controllable 3D generation”), and inclusion criteria (papers that explicitly model action-conditioned state transitions in video or 3D). This addition will allow readers to evaluate coverage and potential bias. revision: yes

  2. Referee: [Benchmarks comparison section] § on benchmarks: The 'thorough' comparison across four fields is presented without a summary table enumerating how many papers per field were reviewed or explicit criteria for benchmark inclusion. This weakens the ability to verify completeness of the cross-field analysis.

    Authors: We agree that an explicit accounting would strengthen the section. The current text compares representative benchmarks but does not tabulate paper counts or selection criteria. In the revision we will add a compact table (new Table X) in the benchmarks section that reports, for each of the four domains, (i) the number of papers and benchmarks reviewed, (ii) the inclusion criteria (relevance to action-conditioned generation, public availability of datasets and metrics, coverage of long-horizon or real-time evaluation), and (iii) any notable omissions with brief justification. This change directly addresses the concern about verifiability. revision: yes

Circularity Check

0 steps flagged

No circularity: survey paper with purely descriptive content

full rationale

This is a survey paper that reviews external literature on interactive world modeling, application scenarios, challenges, and benchmarks. It contains no mathematical derivations, equations, fitted parameters, predictions, or ansatzes that could reduce to self-referential inputs. All claims are summaries of cited works; the representativeness of coverage is an external validity concern, not a circularity issue under the defined patterns. No self-citation load-bearing steps or self-definitional reductions exist.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a literature survey and does not introduce or rely on new free parameters, axioms, or invented entities; any such elements would come from the reviewed papers rather than this work.

pith-pipeline@v0.9.1-grok · 5795 in / 1076 out tokens · 25936 ms · 2026-06-28T17:27:23.470108+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

225 extracted references · 47 linked inside Pith

  1. [1]

    K. J. W. Craik,The nature of explanation. CUP Archive, 1967, vol. 445

  2. [2]

    World models,

    D. Ha and J. Schmidhuber, “World models,”arXiv preprint arXiv:1803.10122, vol. 2, no. 3, 2018

  3. [3]

    A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,

    Y. LeCun, “A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,”Open Review, vol. 62, no. 1, pp. 1–62, 2022

  4. [4]

    ivideogpt: Interactive videogpts are scalable world models,

    J. Wu, S. Yin, N. Feng, X. He, D. Li, J. Hao, and M. Long, “ivideogpt: Interactive videogpts are scalable world models,” NeurIPS, vol. 37, pp. 68 082–68 119, 2024

  5. [6]

    Drivedreamer: Towards real-world-drive world models for au- tonomous driving,

    X. Wang, Z. Zhu, G. Huang, X. Chen, J. Zhu, and J. Lu, “Drivedreamer: Towards real-world-drive world models for au- tonomous driving,” inECCV, 2024, pp. 55–72

  6. [7]

    A compre- hensive survey on world models for embodied ai,

    X. Li, X. He, L. Zhang, M. Wu, X. Li, and Y. Liu, “A compre- hensive survey on world models for embodied ai,”arXiv preprint arXiv:2510.16732, 2025

  7. [8]

    Unrealzoo: Enriching photo-realistic virtual worlds for embod- ied ai,

    F. Zhong, K. Wu, C. Wang, H. Chen, H. Ci, Z. Li, and Y. Wang, “Unrealzoo: Enriching photo-realistic virtual worlds for embod- ied ai,” inICCV, 2025, pp. 5769–5779

  8. [9]

    Navigation world models,

    A. Bar, G. Zhou, D. Tran, T. Darrell, and Y. LeCun, “Navigation world models,” inCVPR, 2025, pp. 15 791–15 801

  9. [10]

    Viva: A video-generative value model for robot reinforcement learning,

    J. Lv, H. Li, J. Li, Y. Nie, F. Kong, Y. Wang, X. Wang, Z. Zhu, C. Ni, Q. Denget al., “Viva: A video-generative value model for robot reinforcement learning,”arXiv preprint arXiv:2604.08168, 2026

  10. [11]

    Matrix-game: Interactive world foundation model,

    Y. Zhang, C. Peng, B. Wang, P . Wang, Q. Zhu, F. Kang, B. Jiang, Z. Gao, E. Li, Y. Liuet al., “Matrix-game: Interactive world foundation model,”arXiv preprint arXiv:2506.18701, 2025

  11. [12]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model,

    X. He, C. Peng, Z. Liu, B. Wang, Y. Zhang, Q. Cui, F. Kang, B. Jiang, M. An, Y. Renet al., “Matrix-game 2.0: An open-source real-time and streaming interactive world model,”arXiv preprint arXiv:2508.13009, 2025

  12. [13]

    Gamefac- tory: Creating new games with generative interactive videos,

    J. Yu, Y. Qin, X. Wang, P . Wan, D. Zhang, and X. Liu, “Gamefac- tory: Creating new games with generative interactive videos,” in ICCV, 2025, pp. 11 590–11 599

  13. [14]

    Building machines that learn and think like people,

    B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman, “Building machines that learn and think like people,”Behavioral and brain sciences, vol. 40, p. e253, 2017

  14. [15]

    Genie: Generative interactive environments,

    J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y. Aytar, S. M. E. Bechtle, F. Behbahani, S. C. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rockt ¨aschel, “Genie: Generative interactive environments,” inICML, 2024

  15. [16]

    Genie 2: A large-scale foundation world model,

    “Genie 2: A large-scale foundation world model,” 2024. [Online]. Available: https://deepmind.google/discover/blog/ genie-2-a-large-scale-foundation-world-model/

  16. [17]

    Genie 3: A new frontier for world models,

    “Genie 3: A new frontier for world models,” 2025. [Online]. Available: https://deepmind.google/models/genie/

  17. [18]

    Worldgen: From text to traversable and interactive 3d worlds,

    D. Wang, H. Jung, T. Monnier, K. Sohn, C. Zou, X. Xiang, Y.- Y. Yeh, D. Liu, Z. Huang, T. Nguyen-Phuocet al., “Worldgen: From text to traversable and interactive 3d worlds,”arXiv preprint arXiv:2511.16825, 2025

  18. [19]

    Lyra 2.0: Explorable generative 3d worlds,

    T. Shen, S. Bahmani, K. He, S. G. Srinivasan, T. Cao, J. Ren, R. Li, Z. Wang, N. Sharp, Z. Gojcic, S. Fidler, J. Huang, H. Ling, J. Gao, and X. Ren, “Lyra 2.0: Explorable generative 3d worlds,”arXiv preprint arXiv:2604.13036, 2026

  19. [20]

    Pixelverse-r1

    “Pixelverse-r1.” [Online]. Available: https://pixverse.ai/en/ blog/pixverse-r1-next-generation-real-time-world-model

  20. [21]

    Happy oyster

    “Happy oyster.” [Online]. Available: https://www.happyoyster. cn/

  21. [22]

    Available: https://runwayml.com/ research/introducing-runway-gwm-1

    “Gwm-1.” [Online]. Available: https://runwayml.com/ research/introducing-runway-gwm-1

  22. [23]

    Wonderjourney: Going from anywhere to everywhere,

    H.-X. Yu, H. Duan, J. Hur, K. Sargent, M. Rubinstein, W. T. Free- man, F. Cole, D. Sun, N. Snavely, J. Wuet al., “Wonderjourney: Going from anywhere to everywhere,” inCVPR, 2024, pp. 6658– 6667

  23. [24]

    Wonderworld: Interactive 3d scene generation from a single image,

    H.-X. Yu, H. Duan, C. Herrmann, W. T. Freeman, and J. Wu, “Wonderworld: Interactive 3d scene generation from a single image,” inCVPR, 2025, pp. 5916–5926

  24. [25]

    Wonderzoom: Multi-scale 3d world generation,

    J. Cao, H.-X. Yu, and J. Wu, “Wonderzoom: Multi-scale 3d world generation,”arXiv preprint arXiv:2512.09164, 2025

  25. [26]

    Advancing open-source world models,

    R. Team, Z. Gao, Q. Wang, Y. Zeng, J. Zhu, K. L. Cheng, Y. Li, H. Wang, Y. Xu, S. Maet al., “Advancing open-source world models,”arXiv preprint arXiv:2601.20540, 2026

  26. [27]

    Matrix-game 3.0 real-time and streaming interactive world model with long-horizon memory,

    S. AI, “Matrix-game 3.0 real-time and streaming interactive world model with long-horizon memory,”arXiv preprint arXiv:, 2026

  27. [28]

    Yume: An interactive world generation model,

    X. Mao, S. Lin, Z. Li, C. Li, W. Peng, T. He, J. Pang, M. Chi, Y. Qiao, and K. Zhang, “Yume: An interactive world generation model,”arXiv preprint arXiv:2507.17744, 2025

  28. [29]

    Yume-1.5: A text-controlled interactive world generation model,

    X. Mao, Z. Li, C. Li, X. Xu, K. Ying, T. He, J. Pang, Y. Qiao, and K. Zhang, “Yume-1.5: A text-controlled interactive world generation model,”arXiv preprint arXiv:2512.22096, 2025

  29. [30]

    Worldplay: Towards long-term geometric consistency for real-time interactive world modeling,

    W. Sun, H. Zhang, H. Wang, J. Wu, Z. Wang, Z. Wang, Y. Wang, J. Zhang, T. Wang, and C. Guo, “Worldplay: Towards long-term geometric consistency for real-time interactive world modeling,” arXiv preprint arXiv:2512.14614, 2025

  30. [31]

    Vmem: Consistent inter- active video scene generation with surfel-indexed view memory,

    R. Li, P . Torr, A. Vedaldi, and T. Jakab, “Vmem: Consistent inter- active video scene generation with surfel-indexed view memory,” inICCV, 2025, pp. 25 690–25 699

  31. [32]

    Solaris: Building a multiplayer video world model in minecraft,

    G. Savva, O. Michel, D. Lu, S. Waiwitlikhit, T. Meehan, D. Mishra, S. Poddar, J. Lu, and S. Xie, “Solaris: Building a multiplayer video world model in minecraft,”arXiv preprint arXiv:2602.22208, 2026

  32. [33]

    Realwon- der: Real-time physical action-conditioned video generation,

    W. Liu, Z. Chen, Z. Li, Y. Wang, H.-X. Yu, and J. Wu, “Realwon- der: Real-time physical action-conditioned video generation,” arXiv preprint arXiv:2603.05449, 2026

  33. [34]

    Available: https:// danwilliamsphilosophy.com/2018/09/07/ chapter-3-kenneth-craiks-hypothesis-on-the-nature-of-thought/

    [Online]. Available: https:// danwilliamsphilosophy.com/2018/09/07/ chapter-3-kenneth-craiks-hypothesis-on-the-nature-of-thought/

  34. [35]

    From word models to world models: Translating from natural language to the prob- abilistic language of thought,

    L. Wong, G. Grand, A. K. Lew, N. D. Goodman, V . K. Mans- inghka, J. Andreas, and J. B. Tenenbaum, “From word models to world models: Translating from natural language to the prob- abilistic language of thought,”arXiv preprint arXiv:2306.12672, 2023

  35. [36]

    Neoverse: Enhancing 4d world model with in-the-wild monocular videos,

    Y. Yang, L. Fan, Z. Shi, J. Peng, F. Wang, and Z. Zhang, “Neoverse: Enhancing 4d world model with in-the-wild monocular videos,” arXiv preprint arXiv:2601.00393, 2026

  36. [37]

    Wan: Open and advanced large-scale video generative models,

    T. Wan, “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

  37. [38]

    Available: https://marble.worldlabs.ai/

    “Marble.” [Online]. Available: https://marble.worldlabs.ai/

  38. [39]

    Physgen3d: Crafting a miniature interactive world from a single image,

    B. Chen, H. Jiang, S. Liu, S. Gupta, Y. Li, H. Zhao, and S. Wang, “Physgen3d: Crafting a miniature interactive world from a single image,” inCVPR, 2025, pp. 6178–6189

  39. [40]

    Vdaworld: World modelling via vlm-directed abstraction and simulation,

    F. O’Mahony, R. Cipolla, and A. Tewari, “Vdaworld: World modelling via vlm-directed abstraction and simulation,”arXiv preprint arXiv:2512.11061, 2025

  40. [41]

    Make-a-video: Text-to-video generation without text-video data,

    U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafniet al., “Make-a-video: Text-to-video generation without text-video data,”arXiv preprint arXiv:2209.14792, 2022

  41. [42]

    Learning interactive real-world simulators,

    S. Yang, Y. Du, S. K. S. Ghasemipour, J. Tompson, L. P . Kaelbling, D. Schuurmans, and P . Abbeel, “Learning interactive real-world simulators,” inICLR, 2024

  42. [43]

    Adding conditional control to text-to-image diffusion models,

    L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inICCV, 2023, pp. 3836–3847

  43. [44]

    Magicworld: Interactive geometry-driven video world exploration,

    G. Li, S. Zheng, S. Xu, J. Chen, B. Li, X. Hu, L. Zhao, and P .- T. Jiang, “Magicworld: Interactive geometry-driven video world exploration,”arXiv preprint arXiv:2511.18886, 2025

  44. [45]

    Worldcompass: Reinforce- ment learning for long-horizon world models,

    Z. Wang, T. Wang, H. Zhang, X. Zuo, J. Wu, H. Wang, W. Sun, Z. Wang, C. Cao, H. Zhaoet al., “Worldcompass: Reinforce- ment learning for long-horizon world models,”arXiv preprint arXiv:2602.09022, 2026

  45. [46]

    Genie envisioner: A unified world foundation platform for robotic manipulation,

    Y. Liao, P . Zhou, S. Huang, D. Yang, S. Chen, Y. Jiang, Y. Hu, J. Cai, S. Liu, J. Luoet al., “Genie envisioner: A unified world foundation platform for robotic manipulation,”arXiv preprint arXiv:2508.05635, 2025

  46. [47]

    The world is your canvas: JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2026 17 Painting promptable events with reference images, trajectories, and text,

    H. Wang, H. Ouyang, Q. Wang, Y. Yu, Y. Meng, W. Wang, K. L. Cheng, S. Ma, Q. Bai, Y. Liet al., “The world is your canvas: JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2026 17 Painting promptable events with reference images, trajectories, and text,”arXiv preprint arXiv:2512.16924, 2025

  47. [48]

    Exploring the limits of transfer learning with a unified text-to-text transformer,

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P . J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”JMLR, vol. 21, no. 140, pp. 1–67, 2020

  48. [49]

    Perpetualwonder: Long- horizon action-conditioned 4d scene generation,

    J. Zhan, Z. Li, H.-X. Yu, and J. Wu, “Perpetualwonder: Long- horizon action-conditioned 4d scene generation,”arXiv preprint arXiv:2602.04876, 2026

  49. [50]

    Interactive world simulator for robot policy training and evaluation,

    Y. Wang, R. Syed, F. Wu, M. Zhang, A. Onol, J. Barreiros, H. Nayyeri, T. Dear, H. Zhang, and Y. Li, “Interactive world simulator for robot policy training and evaluation,”arXiv preprint arXiv:2603.08546, 2026

  50. [51]

    Vid2world: Crafting video diffusion models to interactive world models,

    S. Huang, J. Wu, Q. Zhou, S. Miao, and M. Long, “Vid2world: Crafting video diffusion models to interactive world models,” in ICLR, 2026

  51. [52]

    Auto-encoding variational bayes,

    D. P . Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013

  52. [53]

    Generative adver- sarial nets,

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver- sarial nets,”NeurIPS, vol. 27, 2014

  53. [54]

    Learn- ing to simulate dynamic environments with gamegan,

    S. W. Kim, Y. Zhou, J. Philion, A. Torralba, and S. Fidler, “Learn- ing to simulate dynamic environments with gamegan,” inCVPR, 2020, pp. 1231–1240

  54. [55]

    Deep unsupervised learning using nonequilibrium thermody- namics,

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermody- namics,” inICML, 2015, pp. 2256–2265

  55. [56]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,”NeurIPS, vol. 33, pp. 6840–6851, 2020

  56. [57]

    Gamegen-x: Interactive open-world game video generation,

    H. Che, X. He, Q. Liu, C. Jin, and H. Chen, “Gamegen-x: Interactive open-world game video generation,”arXiv preprint arXiv:2411.00769, 2024

  57. [58]

    Worldmem: Long-term consistent world simulation with memory,

    Z. Xiao, L. Yushi, Y. Zhou, W. Ouyang, S. Yang, Y. Zeng, and X. Pan, “Worldmem: Long-term consistent world simulation with memory,” inNeurIPS, 2025

  58. [59]

    Fantasyworld: Geometry-consistent world modeling via unified video and 3d prediction,

    Y. Dai, F. Jiang, C. Wang, M. Xu, and Y. Qi, “Fantasyworld: Geometry-consistent world modeling via unified video and 3d prediction,” inICLR, 2026

  59. [60]

    Live: Long-horizon interactive video world modeling,

    J. Huang, Z. Ye, X. Hu, T. He, G. Zhang, S. Shi, J. Bian, and L. Jiang, “Live: Long-horizon interactive video world modeling,” arXiv preprint arXiv:2602.03747, 2026

  60. [61]

    Olaf-world: Orienting latent actions for video world modeling,

    Y. Jiang, Y. Gu, I. W. Tsang, and M. Z. Shou, “Olaf-world: Orienting latent actions for video world modeling,”arXiv preprint arXiv:2602.10104, 2026

  61. [62]

    Worldcam: Interactive autoregres- sive 3d gaming worlds with camera pose as a unifying geometric representation,

    J. Nam, Y. Hong, C.-H. P . Huang, F. Liu, J. Lee, J. Kim, S. Jin, Y. Lee, J. Jung, S. Choiet al., “Worldcam: Interactive autoregres- sive 3d gaming worlds with camera pose as a unifying geometric representation,”arXiv preprint arXiv:2603.16871, 2026

  62. [63]

    Scalable diffusion models with transform- ers,

    W. Peebles and S. Xie, “Scalable diffusion models with transform- ers,” inICCV, 2023, pp. 4195–4205

  63. [64]

    Mineworld: a real-time and open-source interactive world model on minecraft,

    J. Guo, Y. Ye, T. He, H. Wu, Y. Jiang, T. Pearce, and J. Bian, “Mineworld: a real-time and open-source interactive world model on minecraft,”arXiv preprint arXiv:2504.08388, 2025

  64. [65]

    Video2game: Real- time interactive realistic and browser-compatible environment from a single video,

    H. Xia, Z.-H. Lin, W.-C. Ma, and S. Wang, “Video2game: Real- time interactive realistic and browser-compatible environment from a single video,” inCVPR, 2024, pp. 4578–4588

  65. [66]

    Diffusion for world modeling: Visual details matter in atari,

    E. Alonso, A. Jelley, V . Micheli, A. Kanervisto, A. Storkey, T. Pearce, and F. Fleuret, “Diffusion for world modeling: Visual details matter in atari,”NeurIPS, vol. 37, pp. 58 757–58 791, 2024

  66. [67]

    Oasis: A universe in a transformer,

    J. Q. Decart, Q. McIntyre, S. Campbell, X. Chen, and R. Wachen, “Oasis: A universe in a transformer,” 2024

  67. [68]

    Diffusion models are real-time game engines,

    D. Valevski, Y. Leviathan, M. Arar, and S. Fruchter, “Diffusion models are real-time game engines,” inICLR, 2025

  68. [69]

    Adaworld: Learning adaptable world models with latent actions,

    S. Gao, S. Zhou, Y. Du, J. Zhang, and C. Gan, “Adaworld: Learning adaptable world models with latent actions,” inICML, 2025

  69. [70]

    Wonderturbo: Generating interactive 3d world in 0.72 seconds,

    C. Ni, X. Wang, Z. Zhu, W. Wang, H. Li, G. Zhao, J. Li, W. Qin, G. Huang, and W. Mei, “Wonderturbo: Generating interactive 3d world in 0.72 seconds,” inICCV, 2025, pp. 27 423–27 434

  70. [71]

    Long-context state-space video world models,

    R. Po, Y. Nitzan, R. Zhang, B. Chen, T. Dao, E. Shechtman, G. Wetzstein, and X. Huang, “Long-context state-space video world models,” inICCV, 2025, pp. 8733–8744

  71. [72]

    Aether: Geometric-aware unified world modeling,

    H. Zhu, Y. Wang, J. Zhou, W. Chang, Y. Zhou, Z. Li, J. Chen, C. Shen, J. Pang, and T. He, “Aether: Geometric-aware unified world modeling,” inICCV, 2025, pp. 8535–8546

  72. [73]

    Video world models with long-term spatial memory,

    T. Wu, S. Yang, R. Po, Y. Xu, Z. Liu, D. Lin, and G. Wet- zstein, “Video world models with long-term spatial memory,” inNeurIPS, 2025

  73. [74]

    Image as a world: Gener- ating interactive world from single image via panoramic video generation,

    D. Gui, X. Guo, W. Zhou, and Y. Lu, “Image as a world: Gener- ating interactive world from single image via panoramic video generation,” inNeurIPS, 2025

  74. [75]

    The matrix: Infinite-horizon world generation with real-time moving control,

    R. Feng, H. Zhang, Z. Shu, Z. Yang, L. Tang, Z. Wang, A. Zheng, J. Xiao, Z. Liu, R. Chu, Y. Huang, Y. Liu, and H. Zhang, “The matrix: Infinite-horizon world generation with real-time moving control,” inNeurIPS, 2025

  75. [76]

    Deepverse: 4d autoregressive video generation as a world model,

    J. Chen, H. Zhu, X. He, Y. Wang, J. Zhou, W. Chang, Y. Zhou, Z. Li, Z. Fu, J. Panget al., “Deepverse: 4d autoregressive video generation as a world model,”arXiv preprint arXiv:2506.01103, 2025

  76. [77]

    Embodiedgen: Towards a generative 3d world engine for embodied intelligence,

    X. Wang, L. Liu, Y. Cao, R. Wu, W. Qin, D. Wang, W. Sui, and Z. Su, “Embodiedgen: Towards a generative 3d world engine for embodied intelligence,”arXiv preprint arXiv:2506.10600, 2025

  77. [78]

    Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition,

    J. Li, J. Tang, Z. Xu, L. Wu, Y. Zhou, S. Shao, T. Yu, Z. Cao, and Q. Lu, “Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition,” 2025

  78. [79]

    Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels,

    H. Team, Z. Wang, Y. Liu, J. Wu, Z. Gu, H. Wang, X. Zuo, T. Huang, W. Li, S. Zhanget al., “Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels,”arXiv preprint arXiv:2507.21809, 2025

  79. [80]

    Matrix-3d: Omnidirectional explorable 3d world generation,

    Z. Yang, W. Ge, Y. Li, J. Chen, H. Li, M. An, F. Kang, H. Xue, B. Xu, Y. Yinet al., “Matrix-3d: Omnidirectional explorable 3d world generation,”arXiv preprint arXiv:2508.08086, 2025

  80. [81]

    Training agents inside of scalable world models,

    D. Hafner, W. Yan, and T. Lillicrap, “Training agents inside of scalable world models,”arXiv preprint arXiv:2509.24527, 2025

Showing first 80 references.