arxiv: 2604.18564 · v2 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Haoyu Wu, Jiwen Yu, Xihui Liu, Yingtian Zou

Pith reviewed 2026-05-10 05:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-agent world modelsmulti-view video generationaction-conditioned generationvideo world modelsmulti-robot manipulationscalable simulationmulti-player environments

0 comments

The pith

MultiWorld is a unified framework that adds dedicated modules to video world models so they can control multiple agents while keeping observations consistent across different views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MultiWorld as a single architecture for generating future video frames in environments that contain several agents and are observed from several viewpoints at once. Prior video world models handled only one agent at a time and could not reliably track interactions or produce matching images from separate cameras. The new design inserts a Multi-Agent Condition Module that routes each agent’s actions into the generation process and a Global State Encoder that shares information across views to prevent contradictions. If these additions work, the resulting videos follow the intended actions more closely and look coherent no matter which camera angle is shown. The approach is tested on multi-player games and teams of robots, where it improves fidelity, action accuracy, and cross-view agreement while allowing the number of agents and views to grow without redesigning the core model.

Core claim

MultiWorld is a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. It introduces the Multi-Agent Condition Module to achieve precise multi-agent controllability and the Global State Encoder to ensure coherent observations across different views. The framework supports flexible scaling of agent and view counts and synthesizes different views in parallel for high efficiency, outperforming baselines in video fidelity, action-following ability, and multi-view consistency on multi-player game environments and multi-robot manipulation tasks.

What carries the argument

The Multi-Agent Condition Module, which injects separate action signals for each agent into the video generator, together with the Global State Encoder, which extracts and broadcasts shared environmental information to keep every view aligned.

If this is right

Precise following of each agent’s actions in the predicted video frames
No visible contradictions between images generated for different camera angles
Ability to add or remove agents and views at inference time without retraining
Higher measured video quality and action adherence than single-agent baselines on the same tasks
Parallel synthesis of all views reduces the time needed to produce a full multi-view prediction

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning and encoding pattern could be tested on real driving footage containing several vehicles observed by multiple roadside cameras.
Combining the model with multi-agent reinforcement learning would let planners use the generated videos as fast forward simulations of team behavior.
The parallel view generation might support online video prediction for robotic teams that must react to teammates they cannot see directly.
Extending the global encoder to handle partial or occluded views could improve robustness when some cameras are blocked.

Load-bearing premise

That inserting the Multi-Agent Condition Module and Global State Encoder into standard video generation networks will produce better controllability and consistency without hidden losses in training stability or ability to generalize to new scenes.

What would settle it

Generate videos for two agents that take conflicting actions while viewed from two cameras; the claim is refuted if the output frames show mismatched object positions or states across the two views or if the agents ignore their individual action inputs.

Figures

Figures reproduced from arXiv: 2604.18564 by Haoyu Wu, Jiwen Yu, Xihui Liu, Yingtian Zou.

**Figure 2.** Figure 2: Pipeline of MultiWorld. We propose MultiWorld, a unified framework for scalable multi-agent multi-view video world modeling. In the Multi-Agent Condition Module (Sec. 3.2), Agent Identity Embedding and Adaptive Action Weighting are employed to achieve multi-agent controllability. In the Global State Encoder (Sec. 3.3), we use a frozen VGGT backbone to extract implicit 3D global environmental information … view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of multi-agent multi-view video generation [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Multi-Robot Failure Trajectory Simulation. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Long-horizon video generation on multi-robot manipulation task. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: MultiWorld Action Controllability. MultiWorld faithfully generates static videos when given static actions. In contrast, action-conditioned video world models often suffer from action bias, where agents tend to move even without explicit action commands [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: MultiWorld Multi-Agent Environment Interactions. [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: MultiWorld Physical Consistency. MultiWorld generates multi-view videos with consistent physical effects across viewpoints. In the upper case, shadows cast by objects remain consistent in two opposite views. In the lower case, MultiWorld simulates footprints as two agents walk on snow, with the deformation persisting correctly across both camera perspectives [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: MultiWorld Failure Case Analysis. Agents often appear ambiguous when occupying small regions of the view, due to limited spatial resolution for distant objects. Vie w 1 Vie w 2 Vie w 3 Frame0-Frame80 [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

**Figure 10.** Figure 10: Multi-Robot Success Trajectory Simulation. [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

read the original abstract

Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present \textbf{MultiWorld}, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MultiWorld adds two named modules to handle multi-agent actions and multi-view consistency in video world models, but the abstract supplies no metrics or ablations to show the gains are real.

read the letter

The paper's main move is to take existing action-conditioned video generators and bolt on a Multi-Agent Condition Module plus a Global State Encoder so the model can track several agents at once while keeping different camera views coherent. It also claims the design scales by adding agents or views and renders the views in parallel. That directly targets the single-agent limit in most current world models, and the test domains (multi-player games, multi-robot manipulation) line up with the stated goal. The parallel synthesis part is a sensible efficiency choice for larger setups. The thinking is straightforward and stays on the practical problem without obvious contradictions or circular claims. The soft spot is the evidence. The abstract says the model beats baselines on fidelity, action following, and consistency but never shows a single number, baseline name, ablation, or error breakdown. Until the results section is checked, it's impossible to know whether the new modules actually drive the improvement or whether training tricks or dataset choices do the work. Scaling is presented as a feature, yet no compute or stability numbers appear either. This is useful reading for people already working on video world models for robotics or games who need a multi-agent starting point. It is not yet something I would cite, because the performance claims rest on unshown data. It does deserve peer review so referees can examine the experiments, ablations, and any scaling curves in detail.

Referee Report

1 major / 1 minor

Summary. The paper introduces MultiWorld, a unified framework for multi-agent multi-view video world modeling. It extends action-conditioned video generation by adding a Multi-Agent Condition Module for precise control of multiple agents and a Global State Encoder to maintain coherent multi-view observations. The framework supports flexible scaling of agent and view counts with parallel view synthesis for efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks are described, with claims of outperformance over baselines in video fidelity, action-following ability, and multi-view consistency.

Significance. If the empirical results hold, the work would extend video world models to handle multi-agent interactions and multi-view consistency, filling a gap in single-agent focused approaches. The design emphasis on scalability and parallel synthesis offers practical advantages for complex environments like robotics and gaming simulations.

major comments (1)

Abstract: The abstract states that MultiWorld 'outperforms baselines in video fidelity, action-following ability, and multi-view consistency' but supplies no quantitative metrics, baseline descriptions, ablation studies, or error analysis. This prevents verification of the central empirical claims from the available text.

minor comments (1)

Abstract: The high-level description of the Multi-Agent Condition Module and Global State Encoder would benefit from more detail on their integration into existing video generation architectures to clarify the technical novelty.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their valuable comments and the opportunity to improve the manuscript. We address the major comment below.

read point-by-point responses

Referee: Abstract: The abstract states that MultiWorld 'outperforms baselines in video fidelity, action-following ability, and multi-view consistency' but supplies no quantitative metrics, baseline descriptions, ablation studies, or error analysis. This prevents verification of the central empirical claims from the available text.

Authors: We agree that the abstract would benefit from including key quantitative results to substantiate the performance claims. The full manuscript provides these details in the Experiments section, including specific metrics for video fidelity, action-following accuracy, and multi-view consistency, along with baseline comparisons and ablation studies. We will revise the abstract to incorporate representative quantitative improvements and baseline references so that the central claims can be verified directly from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces an architectural framework for multi-agent multi-view video generation by describing two new modules (Multi-Agent Condition Module and Global State Encoder) plus parallel synthesis for scaling. No equations, derivations, fitted parameters, or predictions are presented that reduce by construction to the inputs or to self-citations. The central claims rest on experimental comparisons in game and robotics environments rather than any self-referential reduction of outputs to fitted quantities or prior author results. This is a standard model-design paper whose evaluation is externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim depends on the effectiveness of two newly proposed architectural components whose performance is asserted via experiments rather than derived from prior principles or external benchmarks.

invented entities (2)

Multi-Agent Condition Module no independent evidence
purpose: Achieve precise multi-agent controllability
Newly introduced component whose independent validation is not provided outside the paper's claims.
Global State Encoder no independent evidence
purpose: Ensure coherent observations across different views
Newly introduced component whose independent validation is not provided outside the paper's claims.

pith-pipeline@v0.9.0 · 5471 in / 1149 out tokens · 62742 ms · 2026-05-10T05:02:15.408382+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 50 canonical work pages · 10 internal anchors

[1]

Advances in Neural Information Processing Systems37, 58757–58791 (2024) 4

Alonso,E.,Jelley,A.,Micheli,V.,Kanervisto,A.,Storkey,A.J.,Pearce,T.,Fleuret, F.: Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems37, 58757–58791 (2024) 4

2024
[2]

Advances in Neural Information Processing Systems35, 24639–24654 (2022) 9, 2

Baker, B., Akkaya, I., Zhokov, P., Huizinga, J., Tang, J., Ecoffet, A., Houghton, B., Sampedro, R., Clune, J.: Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems35, 24639–24654 (2022) 9, 2

2022
[3]

Ball, P.J., Bauer, J., Belletti, F., Brownfield, B., Ephrat, A., Fruchter, S., Gupta, A., Holsheimer, K., Holynski, A., Hron, J., Kaplanis, C., Limont, M., McGill, M., Oliveira, Y., Parker-Holder, J., Perbet, F., Scully, G., Shar, J., Spencer, S., Tov, O., Villegas, R., Wang, E., Yung, J., Baetu, C., Berbel, J., Bridson, D., Bruce, J., Buttimore, G., Chak...

2025
[4]

In: Forty-first International Conference on Machine Learning (2024) 4

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative interactive environments. In: Forty-first International Conference on Machine Learning (2024) 4

2024
[5]

arXiv preprint arXiv:2508.18797 (2025) 4

Chai, Q., Zheng, Z., Ren, J., Ye, D., Lin, Z., Wang, H.: Causalmace: Causal- ity empowered multi-agents in minecraft cooperative tasks. arXiv preprint arXiv:2508.18797 (2025) 4

work page arXiv 2025
[6]

arXiv preprint arXiv:2509.22642 , year=

Chi, X., Jia, P., Fan, C.K., Ju, X., Mi, W., Zhang, K., Qin, Z., Tian, W., Ge, K., Li, H., Qian, Z., Chen, A., Zhou, Q., Jia, Y., Liu, J., Dai, Y., Wuwu, Q., Bai, C., Wang, Y.K., Li, Y., Chen, L., Bao, Y., Jiang, Z., Zhu, J., Tang, K., An, R., Luo, Y., Feng, Q., Zhou, S., min Chan, C., Hou, C., Xue, W., Han, S., Guo, Y., Zhang, S., Tang, J.: Wow: Towards ...

work page arXiv 2025
[7]

Decart, Julian, Q., Quinn, M., Spruce, C., Xinlei, C., Robert, W.: Oasis: A universe in a transformer (2024),https://oasis-model.github.io/4

2024
[8]

Worldscore: A unified evaluation benchmark for world generation, 2025

Duan, H., Yu, H.X., Chen, S., Fei-Fei, L., Wu, J.: Worldscore: A unified evaluation benchmark for world generation. arXiv preprint arXiv:2504.00983 (2025) 2

work page arXiv 2025
[9]

Euler, L.: Institutionum calculi integralis, vol. 4. impensis Academiae imperialis scientiarum (1845) 3
[10]

The matrix: Infinite-horizon world generation with real-time moving control.arXiv preprint arXiv:2412.03568, 2024

Feng, R., Zhang, H., Yang, Z., Xiao, J., Shu, Z., Liu, Z., Zheng, A., Huang, Y., Liu, Y., Zhang, H.: The matrix: Infinite-horizon world generation with real-time moving control. arXiv preprint arXiv:2412.03568 (2024) 4

work page arXiv 2024
[11]

Multi-agent embodied ai: Advances and future directions.arXiv preprint arXiv:2505.05108,

Feng, Z., Xue, R., Yuan, L., Yu, Y., Ding, N., Liu, M., Gao, B., Sun, J., Zheng, X., Wang, G.: Multi-agent embodied ai: Advances and future directions. arXiv preprint arXiv:2505.05108 (2025) 4

work page arXiv 2025
[12]

Mineworld: a real-time and open-source interactive world model on minecraft,

Guo, J., Ye, Y., He, T., Wu, H., Jiang, Y., Pearce, T., Bian, J.: Mineworld: a real-time and open-source interactive world model on minecraft. arXiv preprint arXiv:2504.08388 (2025) 4 MultiWorld 17

work page arXiv 2025
[13]

He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- ablingcameracontrolfortext-to-videogeneration.arXivpreprintarXiv:2404.02101 (2024) 4

work page internal anchor Pith review arXiv 2024
[14]

Advances in Neural Information Processing Systems (NeurIPS)33, 6840–6851 (2020) 4

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS)33, 6840–6851 (2020) 4

2020
[15]

Huang, J

Huang, S., Wu, J., Zhou, Q., Miao, S., Long, M.: Vid2world: Crafting video diffu- sion models to interactive world models. arXiv preprint arXiv:2505.14357 (2025) 1

work page arXiv 2025
[16]

Huang, X.: Towards video world models (2025),https://www.xunhuang.me/ blogs/world_model.html4

2025
[17]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv: 2506.08009 (2025) 4

work page internal anchor Pith review arXiv 2025
[18]

World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipula- tion, 2025

Jiang, Z., Liu, K., Qin, Y., Tian, S., Zheng, Y., Zhou, M., Yu, C., Li, H., Zhao, D.: World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation. arXiv preprint arXiv:2509.19080 (2025) 4

work page arXiv 2025
[19]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

In: The Eleventh International Conference on Learning Rep- resentations (2023) 4, 6

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Rep- resentations (2023) 4, 6

2023
[21]

In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

Liu, R., Wu, H., Zheng, Z., Wei, C., He, Y., Pi, R., Chen, Q.: Videodpo: Omni- preference alignment for video diffusion generation. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 8009–8019 (2025) 4

2025
[22]

In: The Eleventh International Conference on Learning Representations (2023) 6

Liu, X., Gong, C., et al.: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: The Eleventh International Conference on Learning Representations (2023) 6

2023
[23]

Teamcraft: A benchmark for multi-modal multi-agent systems in minecraft.arXiv preprint arXiv:2412.05255, 2024

Long, Q., Li, Z., Gong, R., Wu, Y.N., Terzopoulos, D., Gao, X.: Teamcraft: A benchmark for multi-modal multi-agent systems in minecraft. arXiv preprint arXiv:2412.05255 (2024) 4

work page arXiv 2024
[24]

arXiv preprint arXiv:2509.13903 (2025) 4

Lykov, A., Sam, J., Nguyen, H.K., Kozlovskiy, V., Mahmoud, Y., Serpiva, V., Cabrera, M.A., Konenkov, M., Tsetserukou, D.: Physicalagent: Towards general cognitive robotics with foundation world models. arXiv preprint arXiv:2509.13903 (2025) 4

work page arXiv 2025
[25]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Lyu, J., Li, Z., Shi, X., Xu, C., Wang, Y., Wang, H.: Dywa: Dynamics-adaptive world action model for generalizable non-prehensile manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11058–11068 (2025) 4

2025
[26]

Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

Mao, X., Lin, S., Li, Z., Li, C., Peng, W., He, T., Pang, J., Chi, M., Qiao, Y., Zhang, K.: Yume: An interactive world generation model. arXiv preprint arXiv:2507.17744 (2025) 1

work page arXiv 2025
[27]

World Simulation with Video Foundation Models for Physical AI

NVIDIA, :, Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., Chattopadhyay, P., Chen, M., Chen, Y., Chen, Y., Cheng, S., Cui, Y., Diamond, J., Ding, Y., Fan, J., Fan, L., Feng, L., Ferroni, F., Fidler, S., Fu, X., Gao, R., Ge, Y., Gu, J., Gupta, A., Gururani, S., Hanafi, I.E., Hassani, A., Hao, Z., Huffm...

work page internal anchor Pith review arXiv 2025
[28]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

URL: https://deepmind

Parker-Holder, J., Ball, P., Bruce, J., Dasagi, V., Holsheimer, K., Kaplanis, C., Moufarek, A., Scully, G., Shar, J., Shi, J., et al.: Genie 2: A large-scale foundation world model. URL: https://deepmind. google/discover/blog/genie-2-a-large-scale- foundation-world-model (2024) 4

2024
[30]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 4

2023
[31]

arXiv preprint arXiv:2503.16408 (2025)

Qin, Y., Kang, L., Song, X., Yin, Z., Liu, X., Liu, X., Zhang, R., Bai, L.: Robo- factory: Exploring embodied agent collaboration with compositional constraints. arXiv preprint arXiv:2503.16408 (2025) 3, 4, 1

work page arXiv 2025
[32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684– 10695 (2022) 4

2022
[33]

Gaia-2: A controllable multi-view generative world model for autonomous driving,

Russell, L., Hu, A., Bertoni, L., Fedoseev, G., Shotton, J., Arani, E., Corrado, G.: Gaia-2: A controllable multi-view generative world model for autonomous driving. arXiv preprint arXiv:2503.20523 (2025) 4

work page arXiv 2025
[34]

Solaris: Building a multiplayer video world model in minecraft.arXiv preprint arXiv:2602.22208, 2026

Savva, G., Michel, O., Lu, D., Waiwitlikhit, S., Meehan, T., Mishra, D., Poddar, S., Lu, J., Xie, S.: Solaris: Building a multiplayer video world model in minecraft. arXiv preprint arXiv:2602.22208 (2026) 2, 4

work page arXiv 2026
[35]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Shin, S., Kim, J., Halilaj, E., Black, M.J.: Wham: Reconstructing world-grounded humans with accurate 3d motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2070–2080 (2024) 4

2070
[36]

Song, K., Chen, B., Simchowitz, M., Du, Y., Tedrake, R., Sitzmann, V.: History- guided video diffusion (2025),https://arxiv.org/abs/2502.067644

work page arXiv 2025
[37]

arXiv preprint arXiv: 2508.09822 (2025) 4

Song, Z., Qin, S., Chen, T., Lin, L., Wang, G.: Physical autoregressive model for robotic manipulation without action pretraining. arXiv preprint arXiv: 2508.09822 (2025) 4

work page arXiv 2025
[38]

Neurocomputing568, 127063 (2024) 6

Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing568, 127063 (2024) 6

2024
[39]

Hunyuan-gamecraft-2: Instruction-following interactive game world model,

Tang, J., Liu, J., Li, J., Wu, L., Yang, H., Zhao, P., Gong, S., Yuan, X., Shao, S., Lu, Q.: Hunyuan-gamecraft-2: Instruction-following interactive game world model. arXiv preprint arXiv:2511.23429 (2025) 4

work page arXiv 2025
[40]

team, E.: Introducing multiverse: The first ai multiplayer world model (2025), https://enigma.inc/blog4

2025
[41]

Advancing open-source world models,

Team, R., Gao, Z., Wang, Q., Zeng, Y., Zhu, J., Cheng, K.L., Li, Y., Wang, H., Xu, Y., Ma, S., et al.: Advancing open-source world models. arXiv preprint arXiv:2601.20540 (2026) 1, 4

work page arXiv 2026
[42]

Advances in neural information processing systems34, 16558–16569 (2021) 2 MultiWorld 19

Teed, Z., Deng, J.: Droid-slam: Deep visual slam for monocular, stereo, and rgb- d cameras. Advances in neural information processing systems34, 16558–16569 (2021) 2 MultiWorld 19

2021
[43]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018) 9

work page internal anchor Pith review arXiv 2018
[44]

Diffusion models are real-time game engines

Valevski, D., Leviathan, Y., Arar, M., Fruchter, S.: Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837 (2024) 4

work page arXiv 2024
[45]

Advances in Neural Information Pro- cessing Systems (NeurIPS)30(2017) 6

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Pro- cessing Systems (NeurIPS)30(2017) 6

2017
[46]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 4, 9, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

arXiv preprint arXiv: 2509.12437 (2025) 4

Wang, D., Sun, Z., Li, Z., Wang, C., Peng, Y., Ye, H., Zarrouki, B., Li, W., Pic- cinini, M., Xie, L., Betz, J.: Enhancing physical consistency in lightweight world models. arXiv preprint arXiv: 2509.12437 (2025) 4

work page arXiv 2025
[48]

In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025) 7, 12

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025) 7, 12

2025
[49]

Wisa: World simulator assistant for physics-aware text-to-video generation.arXiv preprint arXiv:2503.08153, 2025

Wang, J., Ma, A., Cao, K., Zheng, J., Zhang, Z., Feng, J., Liu, S., Ma, Y., Cheng, B., Leng, D., et al.: Wisa: World simulator assistant for physics-aware text-to-video generation. arXiv preprint arXiv:2503.08153 (2025) 4

work page arXiv 2025
[50]

WorldDreamer: Towards general world models for video generation via predicting masked tokens.arXiv preprint arXiv:2401.09985,

Wang, X., Zhu, Z., Huang, G., Wang, B., Chen, X., Lu, J.: Worlddreamer: Towards general world models for video generation via predicting masked tokens. arXiv preprint arXiv:2401.09985 (2024) 1

work page arXiv 2024
[51]

IEEE transactions on image processing 13(4), 600–612 (2004) 9

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004) 9

2004
[52]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry forcing: Mar- rying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982 (2025) 4, 9, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Infinite-world: Scaling interactive world models to 1000-frame horizons via pose-free hierarchical memory.arXiv preprint arXiv:2602.02393, 2026

Wu, R., He, X., Cheng, M., Yang, T., Zhang, Y., Kang, Z., Cai, X., Wei, X., Guo, C., Li, C., et al.: Infinite-world: Scaling interactive world models to 1000- frame horizons via pose-free hierarchical memory. arXiv preprint arXiv:2602.02393 (2026) 4

work page arXiv 2026
[54]

Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

Wu, T., Yang, S., Po, R., Xu, Y., Liu, Z., Lin, D., Wetzstein, G.: Video world models with long-term spatial memory. arXiv preprint arXiv:2506.05284 (2025) 1

work page arXiv 2025
[55]

arXiv preprint arXiv:2602.07854 (2026) 4

Xiang, C., Liu, J., Zhang, J., Yang, X., Fang, Z., Wang, S., Wang, Z., Zou, Y., Su, H., Zhu, J.: Geometry-aware rotary position embedding for consistent video world model. arXiv preprint arXiv:2602.07854 (2026) 4

work page arXiv 2026
[56]

Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

Xiao, Z., Lan, Y., Zhou, Y., Ouyang, W., Yang, S., Zeng, Y., Pan, X.: Worldmem: Long-term consistent world simulation with memory. arXiv preprint arXiv:2504.12369 (2025) 4

work page arXiv 2025
[57]

arXiv preprint arXiv:2601.15281 (2026) 1

Yang, Y., Lv, Z., Pan, T., Wang, H., Yang, B., Yin, H., Li, C., Liu, Z., Si, C.: Stableworld: Towards stable and consistent long interactive video generation. arXiv preprint arXiv:2601.15281 (2026) 1

work page arXiv 2026
[58]

arXiv preprint arXiv: 2512.12751 (2025) 4

Yang, Z., Liu, Z., Lu, Y., Hou, L., Miao, C., Peng, S., Feng, B., Bai, X., Zhao, H.: Geniedrive: Towards physics-aware driving world model with 4d occupancy guided video generation. arXiv preprint arXiv: 2512.12751 (2025) 4

work page arXiv 2025
[59]

arXiv preprint arXiv: 2412.08410 (2024) 4 20 H

Yang, Z., Guo, X., Ding, C., Wang, C., Wu, W.: Physical informed driving world model. arXiv preprint arXiv: 2412.08410 (2024) 4 20 H. Wu et al

work page arXiv 2024
[60]

arXiv preprint arXiv:2503.14070 (2025) 4

Ye, Y., Guo, J., Wu, H., He, T., Pearce, T., Rashid, T., Hofmann, K., Bian, J.: Fast autoregressive video generation with diagonal decoding. arXiv preprint arXiv:2503.14070 (2025) 4

work page arXiv 2025
[61]

Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.:One-stepdiffusionwithdistributionmatchingdistillation.arXivpreprintarXiv: 2311.18828 (2023) 4

work page arXiv 2023
[62]

Context as memory: Scene-consistent interactive long video generation with memory retrieval.arXiv preprint arXiv:2506.03141,

Yu, J., Bai, J., Qin, Y., Liu, Q., Wang, X., Wan, P., Zhang, D., Liu, X.: Context as memory: Scene-consistent interactive long video generation with memory retrieval. arXiv preprint arXiv: 2506.03141 (2025) 4

work page arXiv 2025
[63]

A survey of interactive generative video.arXiv preprint arXiv:2504.21853, 2025

Yu, J., Qin, Y., Che, H., Liu, Q., Wang, X., Wan, P., Zhang, D., Gai, K., Chen, H., Liu, X.: A survey of interactive generative video. arXiv preprint arXiv:2504.21853 (2025) 4

work page arXiv 2025
[64]

Yu, J., Qin, Y., Wang, X., Wan, P., Zhang, D., Liu, X.: Gamefactory: Creating new games with generative interactive videos (2025) 4

2025
[65]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024) 4

work page internal anchor Pith review arXiv 2024
[66]

Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024a

Yuan, S., Yin, Y., Li, Z., Huang, X., Yang, X., Yuan, L.: Helios: Real real-time long video generation model. arXiv preprint arXiv: 2603.04379 (2026) 4

work page arXiv 2026
[67]

Combo: compositional world models for embodied multi-agent cooperation.arXiv preprint arXiv:2404.10775,

Zhang, H., Wang, Z., Lyu, Q., Zhang, Z., Chen, S., Shu, T., Du, Y., Gan, C.: Combo: Compositional world models for embodied multi-agent cooperation. arXiv preprint arXiv:2404.10775 (2024) 2, 4, 12, 13

work page arXiv 2024
[68]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018) 9

2018
[69]

Empowering Multi-Robot Cooperation via Sequential World Models

Zhao, Z., Guo, H., Chen, S., Xu, K., Jiang, B., Zhu, Y., Zhao, D.: Em- powering multi-robot cooperation via sequential world models. arXiv preprint arXiv:2509.13095 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Tesseract: Learning 4d embodied world models, 2025

Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y., Gan, C.: Tesseract: learning 4d embodied world models. arXiv preprint arXiv:2504.20995 (2025) 4

work page arXiv 2025
[71]

Flare: Robot learning with implicit world modeling, 2025

Zheng, R., Wang, J., Reed, S., Bjorck, J., Fang, Y., Hu, F., Jang, J., Kundalia, K., Lin, Z., Magne, L., et al.: Flare: Robot learning with implicit world modeling. arXiv preprint arXiv:2505.15659 (2025) 4

work page arXiv 2025
[72]

arXiv preprint arXiv: 2507.00603 (2025) 4

Zheng, Y., Yang, P., Xing, Z., Zhang, Q., Zheng, Y., Gao, Y., Li, P., Zhang, T., Xia, Z., Jia, P., Zhao, D.: World4drive: End-to-end autonomous driving via intention- aware physical latent world model. arXiv preprint arXiv: 2507.00603 (2025) 4

work page arXiv 2025
[73]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., Kong, T.: Irasim: A fine-grained world model for robot manipulation. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 9834–9844 (2025) 4

2025
[74]

Wmpo: World model-based policy optimization for vision-language-action models, 2025

Zhu, F., Yan, Z., Hong, Z., Shou, Q., Ma, X., Guo, S.: Wmpo: World model- based policy optimization for vision-language-action models. arXiv preprint arXiv: 2511.09515 (2025) 4

work page arXiv 2025
[75]

Zhu, H., Zhao, M., He, G., Su, H., Li, C., Zhu, J.: Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video genera- tion. arXiv preprint arXiv: 2602.02214 (2026) 4 MultiWorld 1 Appendix forMultiWorld: Scalable Multi-Agent Multi-View Video World Models A Dataset Construction Video Game Dataset.We collec...

work page arXiv 2026