pith. machine review for the scientific record. sign in

arxiv: 2604.18564 · v2 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Haoyu Wu, Jiwen Yu, Xihui Liu, Yingtian Zou

Pith reviewed 2026-05-10 05:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-agent world modelsmulti-view video generationaction-conditioned generationvideo world modelsmulti-robot manipulationscalable simulationmulti-player environments
0
0 comments X

The pith

MultiWorld is a unified framework that adds dedicated modules to video world models so they can control multiple agents while keeping observations consistent across different views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MultiWorld as a single architecture for generating future video frames in environments that contain several agents and are observed from several viewpoints at once. Prior video world models handled only one agent at a time and could not reliably track interactions or produce matching images from separate cameras. The new design inserts a Multi-Agent Condition Module that routes each agent’s actions into the generation process and a Global State Encoder that shares information across views to prevent contradictions. If these additions work, the resulting videos follow the intended actions more closely and look coherent no matter which camera angle is shown. The approach is tested on multi-player games and teams of robots, where it improves fidelity, action accuracy, and cross-view agreement while allowing the number of agents and views to grow without redesigning the core model.

Core claim

MultiWorld is a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. It introduces the Multi-Agent Condition Module to achieve precise multi-agent controllability and the Global State Encoder to ensure coherent observations across different views. The framework supports flexible scaling of agent and view counts and synthesizes different views in parallel for high efficiency, outperforming baselines in video fidelity, action-following ability, and multi-view consistency on multi-player game environments and multi-robot manipulation tasks.

What carries the argument

The Multi-Agent Condition Module, which injects separate action signals for each agent into the video generator, together with the Global State Encoder, which extracts and broadcasts shared environmental information to keep every view aligned.

If this is right

  • Precise following of each agent’s actions in the predicted video frames
  • No visible contradictions between images generated for different camera angles
  • Ability to add or remove agents and views at inference time without retraining
  • Higher measured video quality and action adherence than single-agent baselines on the same tasks
  • Parallel synthesis of all views reduces the time needed to produce a full multi-view prediction

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conditioning and encoding pattern could be tested on real driving footage containing several vehicles observed by multiple roadside cameras.
  • Combining the model with multi-agent reinforcement learning would let planners use the generated videos as fast forward simulations of team behavior.
  • The parallel view generation might support online video prediction for robotic teams that must react to teammates they cannot see directly.
  • Extending the global encoder to handle partial or occluded views could improve robustness when some cameras are blocked.

Load-bearing premise

That inserting the Multi-Agent Condition Module and Global State Encoder into standard video generation networks will produce better controllability and consistency without hidden losses in training stability or ability to generalize to new scenes.

What would settle it

Generate videos for two agents that take conflicting actions while viewed from two cameras; the claim is refuted if the output frames show mismatched object positions or states across the two views or if the agents ignore their individual action inputs.

Figures

Figures reproduced from arXiv: 2604.18564 by Haoyu Wu, Jiwen Yu, Xihui Liu, Yingtian Zou.

Figure 1
Figure 1. Figure 1: MultiWorld generates multi-agent multi-view videos. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of MultiWorld. We propose MultiWorld, a unified framework for scalable multi-agent multi-view video world modeling. In the Multi-Agent Condition Module (Sec. 3.2), Agent Identity Embedding and Adaptive Action Weighting are em￾ployed to achieve multi-agent controllability. In the Global State Encoder (Sec. 3.3), we use a frozen VGGT backbone to extract implicit 3D global environmental informa￾tion … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of multi-agent multi-view video generation [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Multi-Robot Failure Trajectory Simulation. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Long-horizon video generation on multi-robot manipulation task. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MultiWorld Action Controllability. MultiWorld faithfully generates static videos when given static actions. In contrast, action-conditioned video world models often suffer from action bias, where agents tend to move even without explicit action commands [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: MultiWorld Multi-Agent Environment Interactions. [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: MultiWorld Physical Consistency. MultiWorld generates multi-view videos with consistent physical effects across viewpoints. In the upper case, shadows cast by objects remain consistent in two opposite views. In the lower case, MultiWorld simulates footprints as two agents walk on snow, with the deformation persisting correctly across both camera perspectives [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: MultiWorld Failure Case Analysis. Agents often appear ambiguous when occupying small regions of the view, due to limited spatial resolution for distant objects. Vie w 1 Vie w 2 Vie w 3 Frame0-Frame80 [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Multi-Robot Success Trajectory Simulation. [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
read the original abstract

Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present \textbf{MultiWorld}, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces MultiWorld, a unified framework for multi-agent multi-view video world modeling. It extends action-conditioned video generation by adding a Multi-Agent Condition Module for precise control of multiple agents and a Global State Encoder to maintain coherent multi-view observations. The framework supports flexible scaling of agent and view counts with parallel view synthesis for efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks are described, with claims of outperformance over baselines in video fidelity, action-following ability, and multi-view consistency.

Significance. If the empirical results hold, the work would extend video world models to handle multi-agent interactions and multi-view consistency, filling a gap in single-agent focused approaches. The design emphasis on scalability and parallel synthesis offers practical advantages for complex environments like robotics and gaming simulations.

major comments (1)
  1. Abstract: The abstract states that MultiWorld 'outperforms baselines in video fidelity, action-following ability, and multi-view consistency' but supplies no quantitative metrics, baseline descriptions, ablation studies, or error analysis. This prevents verification of the central empirical claims from the available text.
minor comments (1)
  1. Abstract: The high-level description of the Multi-Agent Condition Module and Global State Encoder would benefit from more detail on their integration into existing video generation architectures to clarify the technical novelty.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their valuable comments and the opportunity to improve the manuscript. We address the major comment below.

read point-by-point responses
  1. Referee: Abstract: The abstract states that MultiWorld 'outperforms baselines in video fidelity, action-following ability, and multi-view consistency' but supplies no quantitative metrics, baseline descriptions, ablation studies, or error analysis. This prevents verification of the central empirical claims from the available text.

    Authors: We agree that the abstract would benefit from including key quantitative results to substantiate the performance claims. The full manuscript provides these details in the Experiments section, including specific metrics for video fidelity, action-following accuracy, and multi-view consistency, along with baseline comparisons and ablation studies. We will revise the abstract to incorporate representative quantitative improvements and baseline references so that the central claims can be verified directly from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces an architectural framework for multi-agent multi-view video generation by describing two new modules (Multi-Agent Condition Module and Global State Encoder) plus parallel synthesis for scaling. No equations, derivations, fitted parameters, or predictions are presented that reduce by construction to the inputs or to self-citations. The central claims rest on experimental comparisons in game and robotics environments rather than any self-referential reduction of outputs to fitted quantities or prior author results. This is a standard model-design paper whose evaluation is externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim depends on the effectiveness of two newly proposed architectural components whose performance is asserted via experiments rather than derived from prior principles or external benchmarks.

invented entities (2)
  • Multi-Agent Condition Module no independent evidence
    purpose: Achieve precise multi-agent controllability
    Newly introduced component whose independent validation is not provided outside the paper's claims.
  • Global State Encoder no independent evidence
    purpose: Ensure coherent observations across different views
    Newly introduced component whose independent validation is not provided outside the paper's claims.

pith-pipeline@v0.9.0 · 5471 in / 1149 out tokens · 62742 ms · 2026-05-10T05:02:15.408382+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 50 canonical work pages · 10 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems37, 58757–58791 (2024) 4

    Alonso,E.,Jelley,A.,Micheli,V.,Kanervisto,A.,Storkey,A.J.,Pearce,T.,Fleuret, F.: Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems37, 58757–58791 (2024) 4

  2. [2]

    Advances in Neural Information Processing Systems35, 24639–24654 (2022) 9, 2

    Baker, B., Akkaya, I., Zhokov, P., Huizinga, J., Tang, J., Ecoffet, A., Houghton, B., Sampedro, R., Clune, J.: Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems35, 24639–24654 (2022) 9, 2

  3. [3]

    Ball, P.J., Bauer, J., Belletti, F., Brownfield, B., Ephrat, A., Fruchter, S., Gupta, A., Holsheimer, K., Holynski, A., Hron, J., Kaplanis, C., Limont, M., McGill, M., Oliveira, Y., Parker-Holder, J., Perbet, F., Scully, G., Shar, J., Spencer, S., Tov, O., Villegas, R., Wang, E., Yung, J., Baetu, C., Berbel, J., Bridson, D., Bruce, J., Buttimore, G., Chak...

  4. [4]

    In: Forty-first International Conference on Machine Learning (2024) 4

    Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative interactive environments. In: Forty-first International Conference on Machine Learning (2024) 4

  5. [5]

    arXiv preprint arXiv:2508.18797 (2025) 4

    Chai, Q., Zheng, Z., Ren, J., Ye, D., Lin, Z., Wang, H.: Causalmace: Causal- ity empowered multi-agents in minecraft cooperative tasks. arXiv preprint arXiv:2508.18797 (2025) 4

  6. [6]

    arXiv preprint arXiv:2509.22642 , year=

    Chi, X., Jia, P., Fan, C.K., Ju, X., Mi, W., Zhang, K., Qin, Z., Tian, W., Ge, K., Li, H., Qian, Z., Chen, A., Zhou, Q., Jia, Y., Liu, J., Dai, Y., Wuwu, Q., Bai, C., Wang, Y.K., Li, Y., Chen, L., Bao, Y., Jiang, Z., Zhu, J., Tang, K., An, R., Luo, Y., Feng, Q., Zhou, S., min Chan, C., Hou, C., Xue, W., Han, S., Guo, Y., Zhang, S., Tang, J.: Wow: Towards ...

  7. [7]

    Decart, Julian, Q., Quinn, M., Spruce, C., Xinlei, C., Robert, W.: Oasis: A universe in a transformer (2024),https://oasis-model.github.io/4

  8. [8]

    Worldscore: A unified evaluation benchmark for world generation, 2025

    Duan, H., Yu, H.X., Chen, S., Fei-Fei, L., Wu, J.: Worldscore: A unified evaluation benchmark for world generation. arXiv preprint arXiv:2504.00983 (2025) 2

  9. [9]

    Euler, L.: Institutionum calculi integralis, vol. 4. impensis Academiae imperialis scientiarum (1845) 3

  10. [10]

    The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024

    Feng, R., Zhang, H., Yang, Z., Xiao, J., Shu, Z., Liu, Z., Zheng, A., Huang, Y., Liu, Y., Zhang, H.: The matrix: Infinite-horizon world generation with real-time moving control. arXiv preprint arXiv:2412.03568 (2024) 4

  11. [11]

    Multi-agent embodied ai: Advances and future directions.arXiv preprint arXiv:2505.05108,

    Feng, Z., Xue, R., Yuan, L., Yu, Y., Ding, N., Liu, M., Gao, B., Sun, J., Zheng, X., Wang, G.: Multi-agent embodied ai: Advances and future directions. arXiv preprint arXiv:2505.05108 (2025) 4

  12. [12]

    Mineworld: a real-time and open-source interactive world model on minecraft,

    Guo, J., Ye, Y., He, T., Wu, H., Jiang, Y., Pearce, T., Bian, J.: Mineworld: a real-time and open-source interactive world model on minecraft. arXiv preprint arXiv:2504.08388 (2025) 4 MultiWorld 17

  13. [13]

    He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- ablingcameracontrolfortext-to-videogeneration.arXivpreprintarXiv:2404.02101 (2024) 4

  14. [14]

    Advances in Neural Information Processing Systems (NeurIPS)33, 6840–6851 (2020) 4

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS)33, 6840–6851 (2020) 4

  15. [15]

    Huang, J

    Huang, S., Wu, J., Zhou, Q., Miao, S., Long, M.: Vid2world: Crafting video diffu- sion models to interactive world models. arXiv preprint arXiv:2505.14357 (2025) 1

  16. [16]

    Huang, X.: Towards video world models (2025),https://www.xunhuang.me/ blogs/world_model.html4

  17. [17]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv: 2506.08009 (2025) 4

  18. [18]

    World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipula- tion, 2025

    Jiang, Z., Liu, K., Qin, Y., Tian, S., Zheng, Y., Zhou, M., Yu, C., Li, H., Zhao, D.: World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation. arXiv preprint arXiv:2509.19080 (2025) 4

  19. [19]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 1

  20. [20]

    In: The Eleventh International Conference on Learning Rep- resentations (2023) 4, 6

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Rep- resentations (2023) 4, 6

  21. [21]

    In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

    Liu, R., Wu, H., Zheng, Z., Wei, C., He, Y., Pi, R., Chen, Q.: Videodpo: Omni- preference alignment for video diffusion generation. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 8009–8019 (2025) 4

  22. [22]

    In: The Eleventh International Conference on Learning Representations (2023) 6

    Liu, X., Gong, C., et al.: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: The Eleventh International Conference on Learning Representations (2023) 6

  23. [23]

    Teamcraft: A benchmark for multi-modal multi-agent systems in minecraft.arXiv preprint arXiv:2412.05255, 2024

    Long, Q., Li, Z., Gong, R., Wu, Y.N., Terzopoulos, D., Gao, X.: Teamcraft: A benchmark for multi-modal multi-agent systems in minecraft. arXiv preprint arXiv:2412.05255 (2024) 4

  24. [24]

    arXiv preprint arXiv:2509.13903 (2025) 4

    Lykov, A., Sam, J., Nguyen, H.K., Kozlovskiy, V., Mahmoud, Y., Serpiva, V., Cabrera, M.A., Konenkov, M., Tsetserukou, D.: Physicalagent: Towards general cognitive robotics with foundation world models. arXiv preprint arXiv:2509.13903 (2025) 4

  25. [25]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Lyu, J., Li, Z., Shi, X., Xu, C., Wang, Y., Wang, H.: Dywa: Dynamics-adaptive world action model for generalizable non-prehensile manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11058–11068 (2025) 4

  26. [26]

    Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

    Mao, X., Lin, S., Li, Z., Li, C., Peng, W., He, T., Pang, J., Chi, M., Qiao, Y., Zhang, K.: Yume: An interactive world generation model. arXiv preprint arXiv:2507.17744 (2025) 1

  27. [27]

    World Simulation with Video Foundation Models for Physical AI

    NVIDIA, :, Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., Chattopadhyay, P., Chen, M., Chen, Y., Chen, Y., Cheng, S., Cui, Y., Diamond, J., Ding, Y., Fan, J., Fan, L., Feng, L., Ferroni, F., Fidler, S., Fu, X., Gao, R., Ge, Y., Gu, J., Gupta, A., Gururani, S., Hanafi, I.E., Hassani, A., Hao, Z., Huffm...

  28. [28]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 12

  29. [29]

    URL: https://deepmind

    Parker-Holder, J., Ball, P., Bruce, J., Dasagi, V., Holsheimer, K., Kaplanis, C., Moufarek, A., Scully, G., Shar, J., Shi, J., et al.: Genie 2: A large-scale foundation world model. URL: https://deepmind. google/discover/blog/genie-2-a-large-scale- foundation-world-model (2024) 4

  30. [30]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 4

  31. [31]

    arXiv preprint arXiv:2503.16408 (2025)

    Qin, Y., Kang, L., Song, X., Yin, Z., Liu, X., Liu, X., Zhang, R., Bai, L.: Robo- factory: Exploring embodied agent collaboration with compositional constraints. arXiv preprint arXiv:2503.16408 (2025) 3, 4, 1

  32. [32]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684– 10695 (2022) 4

  33. [33]

    Gaia-2: A controllable multi-view generative world model for autonomous driving,

    Russell, L., Hu, A., Bertoni, L., Fedoseev, G., Shotton, J., Arani, E., Corrado, G.: Gaia-2: A controllable multi-view generative world model for autonomous driving. arXiv preprint arXiv:2503.20523 (2025) 4

  34. [34]

    Solaris: Building a multiplayer video world model in minecraft.arXiv preprint arXiv:2602.22208, 2026

    Savva, G., Michel, O., Lu, D., Waiwitlikhit, S., Meehan, T., Mishra, D., Poddar, S., Lu, J., Xie, S.: Solaris: Building a multiplayer video world model in minecraft. arXiv preprint arXiv:2602.22208 (2026) 2, 4

  35. [35]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Shin, S., Kim, J., Halilaj, E., Black, M.J.: Wham: Reconstructing world-grounded humans with accurate 3d motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2070–2080 (2024) 4

  36. [36]

    Song, K., Chen, B., Simchowitz, M., Du, Y., Tedrake, R., Sitzmann, V.: History- guided video diffusion (2025),https://arxiv.org/abs/2502.067644

  37. [37]

    arXiv preprint arXiv: 2508.09822 (2025) 4

    Song, Z., Qin, S., Chen, T., Lin, L., Wang, G.: Physical autoregressive model for robotic manipulation without action pretraining. arXiv preprint arXiv: 2508.09822 (2025) 4

  38. [38]

    Neurocomputing568, 127063 (2024) 6

    Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing568, 127063 (2024) 6

  39. [39]

    Hunyuan-gamecraft-2: Instruction-following interactive game world model,

    Tang, J., Liu, J., Li, J., Wu, L., Yang, H., Zhao, P., Gong, S., Yuan, X., Shao, S., Lu, Q.: Hunyuan-gamecraft-2: Instruction-following interactive game world model. arXiv preprint arXiv:2511.23429 (2025) 4

  40. [40]

    team, E.: Introducing multiverse: The first ai multiplayer world model (2025), https://enigma.inc/blog4

  41. [41]

    Advancing open-source world models,

    Team, R., Gao, Z., Wang, Q., Zeng, Y., Zhu, J., Cheng, K.L., Li, Y., Wang, H., Xu, Y., Ma, S., et al.: Advancing open-source world models. arXiv preprint arXiv:2601.20540 (2026) 1, 4

  42. [42]

    Advances in neural information processing systems34, 16558–16569 (2021) 2 MultiWorld 19

    Teed, Z., Deng, J.: Droid-slam: Deep visual slam for monocular, stereo, and rgb- d cameras. Advances in neural information processing systems34, 16558–16569 (2021) 2 MultiWorld 19

  43. [43]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018) 9

  44. [44]

    Diffusion models are real-time game engines

    Valevski, D., Leviathan, Y., Arar, M., Fruchter, S.: Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837 (2024) 4

  45. [45]

    Advances in Neural Information Pro- cessing Systems (NeurIPS)30(2017) 6

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Pro- cessing Systems (NeurIPS)30(2017) 6

  46. [46]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 4, 9, 12

  47. [47]

    arXiv preprint arXiv: 2509.12437 (2025) 4

    Wang, D., Sun, Z., Li, Z., Wang, C., Peng, Y., Ye, H., Zarrouki, B., Li, W., Pic- cinini, M., Xie, L., Betz, J.: Enhancing physical consistency in lightweight world models. arXiv preprint arXiv: 2509.12437 (2025) 4

  48. [48]

    In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025) 7, 12

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025) 7, 12

  49. [49]

    Wisa: World simulator assistant for physics-aware text-to-video generation.arXiv preprint arXiv:2503.08153, 2025

    Wang, J., Ma, A., Cao, K., Zheng, J., Zhang, Z., Feng, J., Liu, S., Ma, Y., Cheng, B., Leng, D., et al.: Wisa: World simulator assistant for physics-aware text-to-video generation. arXiv preprint arXiv:2503.08153 (2025) 4

  50. [50]

    WorldDreamer: Towards general world models for video generation via predicting masked tokens.arXiv preprint arXiv:2401.09985,

    Wang, X., Zhu, Z., Huang, G., Wang, B., Chen, X., Lu, J.: Worlddreamer: Towards general world models for video generation via predicting masked tokens. arXiv preprint arXiv:2401.09985 (2024) 1

  51. [51]

    IEEE transactions on image processing 13(4), 600–612 (2004) 9

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004) 9

  52. [52]

    Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

    Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry forcing: Mar- rying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982 (2025) 4, 9, 2

  53. [53]

    Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

    Wu, R., He, X., Cheng, M., Yang, T., Zhang, Y., Kang, Z., Cai, X., Wei, X., Guo, C., Li, C., et al.: Infinite-world: Scaling interactive world models to 1000- frame horizons via pose-free hierarchical memory. arXiv preprint arXiv:2602.02393 (2026) 4

  54. [54]

    Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

    Wu, T., Yang, S., Po, R., Xu, Y., Liu, Z., Lin, D., Wetzstein, G.: Video world models with long-term spatial memory. arXiv preprint arXiv:2506.05284 (2025) 1

  55. [55]

    arXiv preprint arXiv:2602.07854 (2026) 4

    Xiang, C., Liu, J., Zhang, J., Yang, X., Fang, Z., Wang, S., Wang, Z., Zou, Y., Su, H., Zhu, J.: Geometry-aware rotary position embedding for consistent video world model. arXiv preprint arXiv:2602.07854 (2026) 4

  56. [56]

    Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

    Xiao, Z., Lan, Y., Zhou, Y., Ouyang, W., Yang, S., Zeng, Y., Pan, X.: Worldmem: Long-term consistent world simulation with memory. arXiv preprint arXiv:2504.12369 (2025) 4

  57. [57]

    arXiv preprint arXiv:2601.15281 (2026) 1

    Yang, Y., Lv, Z., Pan, T., Wang, H., Yang, B., Yin, H., Li, C., Liu, Z., Si, C.: Stableworld: Towards stable and consistent long interactive video generation. arXiv preprint arXiv:2601.15281 (2026) 1

  58. [58]

    arXiv preprint arXiv: 2512.12751 (2025) 4

    Yang, Z., Liu, Z., Lu, Y., Hou, L., Miao, C., Peng, S., Feng, B., Bai, X., Zhao, H.: Geniedrive: Towards physics-aware driving world model with 4d occupancy guided video generation. arXiv preprint arXiv: 2512.12751 (2025) 4

  59. [59]

    arXiv preprint arXiv: 2412.08410 (2024) 4 20 H

    Yang, Z., Guo, X., Ding, C., Wang, C., Wu, W.: Physical informed driving world model. arXiv preprint arXiv: 2412.08410 (2024) 4 20 H. Wu et al

  60. [60]

    arXiv preprint arXiv:2503.14070 (2025) 4

    Ye, Y., Guo, J., Wu, H., He, T., Pearce, T., Rashid, T., Hofmann, K., Bian, J.: Fast autoregressive video generation with diagonal decoding. arXiv preprint arXiv:2503.14070 (2025) 4

  61. [61]

    Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.:One-stepdiffusionwithdistributionmatchingdistillation.arXivpreprintarXiv: 2311.18828 (2023) 4

  62. [62]

    Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141,

    Yu, J., Bai, J., Qin, Y., Liu, Q., Wang, X., Wan, P., Zhang, D., Liu, X.: Context as memory: Scene-consistent interactive long video generation with memory retrieval. arXiv preprint arXiv: 2506.03141 (2025) 4

  63. [63]

    A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025

    Yu, J., Qin, Y., Che, H., Liu, Q., Wang, X., Wan, P., Zhang, D., Gai, K., Chen, H., Liu, X.: A survey of interactive generative video. arXiv preprint arXiv:2504.21853 (2025) 4

  64. [64]

    Yu, J., Qin, Y., Wang, X., Wan, P., Zhang, D., Liu, X.: Gamefactory: Creating new games with generative interactive videos (2025) 4

  65. [65]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024) 4

  66. [66]

    Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024a

    Yuan, S., Yin, Y., Li, Z., Huang, X., Yang, X., Yuan, L.: Helios: Real real-time long video generation model. arXiv preprint arXiv: 2603.04379 (2026) 4

  67. [67]

    Combo: compositional world models for embodied multi-agent cooperation.arXiv preprint arXiv:2404.10775,

    Zhang, H., Wang, Z., Lyu, Q., Zhang, Z., Chen, S., Shu, T., Du, Y., Gan, C.: Combo: Compositional world models for embodied multi-agent cooperation. arXiv preprint arXiv:2404.10775 (2024) 2, 4, 12, 13

  68. [68]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018) 9

  69. [69]

    Empowering Multi-Robot Cooperation via Sequential World Models

    Zhao, Z., Guo, H., Chen, S., Xu, K., Jiang, B., Zhu, Y., Zhao, D.: Em- powering multi-robot cooperation via sequential world models. arXiv preprint arXiv:2509.13095 (2025) 4

  70. [70]

    Tesseract: Learning 4d embodied world models, 2025

    Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y., Gan, C.: Tesseract: learning 4d embodied world models. arXiv preprint arXiv:2504.20995 (2025) 4

  71. [71]

    Flare: Robot learning with implicit world modeling, 2025

    Zheng, R., Wang, J., Reed, S., Bjorck, J., Fang, Y., Hu, F., Jang, J., Kundalia, K., Lin, Z., Magne, L., et al.: Flare: Robot learning with implicit world modeling. arXiv preprint arXiv:2505.15659 (2025) 4

  72. [72]

    arXiv preprint arXiv: 2507.00603 (2025) 4

    Zheng, Y., Yang, P., Xing, Z., Zhang, Q., Zheng, Y., Gao, Y., Li, P., Zhang, T., Xia, Z., Jia, P., Zhao, D.: World4drive: End-to-end autonomous driving via intention- aware physical latent world model. arXiv preprint arXiv: 2507.00603 (2025) 4

  73. [73]

    In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

    Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., Kong, T.: Irasim: A fine-grained world model for robot manipulation. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 9834–9844 (2025) 4

  74. [74]

    Wmpo: World model-based policy optimization for vision-language-action models, 2025

    Zhu, F., Yan, Z., Hong, Z., Shou, Q., Ma, X., Guo, S.: Wmpo: World model- based policy optimization for vision-language-action models. arXiv preprint arXiv: 2511.09515 (2025) 4

  75. [75]

    Zhu, H., Zhao, M., He, G., Su, H., Li, C., Zhu, J.: Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video genera- tion. arXiv preprint arXiv: 2602.02214 (2026) 4 MultiWorld 1 Appendix forMultiWorld: Scalable Multi-Agent Multi-View Video World Models A Dataset Construction Video Game Dataset.We collec...