pith. machine review for the scientific record. sign in

arxiv: 2605.01896 · v1 · submitted 2026-05-03 · 💻 cs.CV

Recognition: unknown

Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

Junyuan Xiao , Dingkang Liang , Xin Zhou , Yixuan Ye , Tongtong Su , Guangmo Yi , Bin Xia , Qiang Lyu , Shurui Shi , Jun Huang , Jianlou Si , Wenming Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-modal world modelsrepresentation alignmentdiffusion modelsvideo generationfoundation modelsdecoupled featuresmultimodal generationrepresentation learning
0
0 comments X

The pith

Decoupling modality features in a diffusion model allows each to align with a separate foundation model expert for improved multi-modal video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces M2-REPA, a representation alignment technique for multi-modal world models that generate videos across modalities such as RGB, depth, and masks. It claims that foundation models trained on single modalities hold distinct priors that act as complementary experts, and that these can be fully used by first separating modality-specific features from the diffusion model's intermediate layers and then aligning each feature set to its matching expert. Two objectives support this: an alignment loss that matches features to experts and a decoupling regularization that preserves complementarity across modalities. The result is joint optimization that avoids conflicts and yields better visual quality plus longer-term consistency than prior approaches.

Core claim

The central claim is that decoupling modality-specific features from the diffusion model's intermediate representations and aligning each decoupled feature to its corresponding expert foundation model via a multi-modal representation alignment loss together with a modality-specific decoupling regularization enables joint optimization that fully exploits the distinct domain-specific priors from multiple foundation models.

What carries the argument

The M2-REPA method, which decouples modality-specific features from diffusion representations and aligns them individually to expert foundation models using alignment and regularization losses.

If this is right

  • Multi-modal videos achieve higher visual quality by drawing on specialized priors from individual foundation models.
  • Long-term consistency improves because complementary expert knowledge is preserved across frames.
  • Joint training of the diffusion model becomes feasible while retaining the benefits of modality-specific pre-training.
  • The approach outperforms standard baselines across metrics for quality and temporal coherence in experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation-and-align strategy could apply to other multi-modal generation tasks such as audio-video or text-image synthesis.
  • It offers a route to combine many existing single-modality experts without retraining a single giant model from scratch.
  • Adding further modalities might scale the quality gains without proportional increases in training cost.

Load-bearing premise

That foundation models for different modalities each carry distinct complementary information and that a decoupling step can isolate their features without erasing useful shared structure between modalities.

What would settle it

An ablation experiment that removes the decoupling regularization and measures whether performance on visual quality and long-term consistency drops compared to the full method, or whether aligning all modalities to a single mismatched expert still matches the reported gains.

Figures

Figures reproduced from arXiv: 2605.01896 by Bin Xia, Dingkang Liang, Guangmo Yi, Jianlou Si, Jun Huang, Junyuan Xiao, Qiang Lyu, Shurui Shi, Tongtong Su, Wenming Yang, Xin Zhou, Yixuan Ye.

Figure 1
Figure 1. Figure 1: Visualization of M2 -REPA. The features extracted from the backbones of DI￾NOv2, DepthAnythingV2, and SAM2 exhibit pronounced differences, validating their dis￾tinct modality-specific characteristics. After applying our M2 -REPA, the extracted dif￾fusion features align more closely with the rich semantic information from the founda￾tion models. Meanwhile, recent work [27, 31, 46, 56, 60] has demonstrated t… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of M2 -REPA. To mitigate feature conflicts, M2 -REPA decouples multi-modal features into modality-specific representations and aligns them with cor￾responding expert foundation models. The framework is optimized by two synergistic objectives: (1) a cosine similarity-based multi-modal alignment loss for joint represen￾tation alignment, and (2) a CKA [23] similarity-driven modality-specific decoupli… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of camera view-conditioned video generation under fullcircle rotation. Videos are generated from a input frame and corresponding per￾frame camera poses simulating a full 360° rotation. 5.1 Quantitative comparisons Real-world Scene Generation. We conduct comprehensive quantitative eval￾uations on the RealEstate10K dataset [62], covering both short-term (8-frame) and long-term (200-fra… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison in ablation study. Under the 200-frame long video set￾ting, we compare our method (M2 -REPA) against baseline and REPA methods. Effectiveness of Modality-Specific Decoupling Regularization. The re￾sults demonstrate that incorporating modality-specific decoupling regulariza￾tion yields substantial performance improvements compared to the naive direct alignment of multiple foundation m… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on alignment layer depth. FVD-200 and FVD-8 scores across different alignment layers of the diffusion backbone. Robustness to Supervision Bias. To preclude potential “self-fulfilling” ef￾fects—where the model might merely distill the same teacher used for label generation—we conduct a cross-source robustness study. Specifically, we train the baseline using depth labels from Video-Depth-Anything [5… view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of M2 -REPA based on the DiT backbone. We retain the original RGB projector and design new output projectors for depth and mask. Transformer-based projectors is both redundant and computationally inefficient. Third, the compact nature of MLPs facilitates superior training stability and ef￾ficiency—a critical property when serving as an auxiliary regularization term. As shown in Tab. 6, we empi… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of long-horizon action-conditioned video generation (150 frames) in the Minecraft environment. Each video is generated from an initial frame observation and the corresponding per-frame action sequence. Our M2 -REPA maintains stable scene coherence and temporal consistency throughout the generation horizon. †D+S+DA denotes the naive combination of DINOv2, SAM2, and DepthAny￾thingV2 [… view at source ↗
read the original abstract

Emerging multi-modal world models attempt to jointly generate videos across diverse modalities (e.g., RGB, depth, and mask), yet they fail to fully exploit the rich priors of existing foundation models. We propose $M^2$-REPA, the first representation alignment method tailored for multi-modal video generation. Our key insight is that foundation models trained on different modality spaces naturally capture distinct domain-specific priors, acting as complementary "experts." Specifically, we first decouple modality-specific features from the diffusion model's intermediate representations, then align each with its corresponding expert foundation model. To this end, we design two synergistic objectives: a multi-modal representation alignment loss that enforces feature-to-expert matching, and a modality-specific decoupling regularization that encourages complementarity across different modalities. This design enables joint optimization, fully exploiting priors from multiple foundation models. Extensive experiments demonstrate that our method significantly outperforms baselines in visual quality and long-term consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes M²-REPA, a decoupled representation alignment method for multimodal world models in video generation. It decouples modality-specific features from the diffusion model's intermediate representations, then aligns each with its corresponding expert foundation model via a multi-modal representation alignment loss and a modality-specific decoupling regularization. The design is claimed to enable joint optimization that fully exploits complementary priors from multiple foundation models, with experiments showing significant gains in visual quality and long-term consistency over baselines.

Significance. If the central claims hold, this work could meaningfully advance multimodal video generation by offering a practical way to integrate diverse foundation-model priors without optimization conflicts. The decoupling regularization addresses a recognized challenge in joint multi-expert training and, if shown to preserve complementarity, would provide a reusable template for other multimodal synthesis tasks.

major comments (2)
  1. [Method] Method section (description of the two synergistic objectives): the modality-specific decoupling regularization is presented as encouraging complementarity across modalities, yet no analysis, bound, or ablation demonstrates that the term separates features rather than functioning as an orthogonality penalty that could discard cross-modal information or induce gradient conflicts during simultaneous expert optimization. This directly underpins the claim that joint optimization is enabled without new conflicts.
  2. [Experiments] Experiments section: performance improvements in visual quality and long-term consistency are reported, but the manuscript provides no controls comparing the full M²-REPA pipeline against independent per-modality alignments (i.e., the same alignment losses without the decoupling term). Without such isolation, it remains unclear whether the reported gains require the proposed decoupling or could be obtained by simpler joint training.
minor comments (2)
  1. [Abstract] The abstract states that foundation models 'naturally capture distinct domain-specific priors, acting as complementary experts,' but this assumption is not accompanied by a reference or preliminary measurement of prior complementarity on the target modalities.
  2. [Method] Notation for the two loss terms and the decoupling regularizer should be introduced with explicit equations early in the method section to allow readers to verify the claimed synergy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method] Method section (description of the two synergistic objectives): the modality-specific decoupling regularization is presented as encouraging complementarity across modalities, yet no analysis, bound, or ablation demonstrates that the term separates features rather than functioning as an orthogonality penalty that could discard cross-modal information or induce gradient conflicts during simultaneous expert optimization. This directly underpins the claim that joint optimization is enabled without new conflicts.

    Authors: We acknowledge that the original manuscript does not include a formal analysis, bound, or dedicated ablation isolating the decoupling regularizer's effect on feature separation versus potential information loss or gradient issues. To address this, we will revise the method section to add a short theoretical motivation: the regularizer is applied after explicit modality-specific feature extraction (via the decoupling process), encouraging orthogonality only within those subspaces while the alignment loss preserves any necessary cross-modal complementarity. We will also add an ablation study reporting inter-modality feature correlations and per-expert gradient statistics with/without the term, demonstrating that it reduces conflicts without discarding useful shared information. These changes will directly support the joint-optimization claim. revision: yes

  2. Referee: [Experiments] Experiments section: performance improvements in visual quality and long-term consistency are reported, but the manuscript provides no controls comparing the full M²-REPA pipeline against independent per-modality alignments (i.e., the same alignment losses without the decoupling term). Without such isolation, it remains unclear whether the reported gains require the proposed decoupling or could be obtained by simpler joint training.

    Authors: We agree that the current experiments do not isolate the decoupling term's contribution via a direct control against independent per-modality alignments. The reported baselines omit alignment entirely, leaving open whether simpler joint training suffices. In the revision we will add new results training separate per-modality models using only the alignment losses (no decoupling regularizer) and compare them quantitatively to the full joint M²-REPA model on the same metrics. These controls will show that independent alignments improve over non-aligned baselines but still underperform the joint model in long-term consistency and visual quality, thereby confirming that the decoupling term is necessary for effective multi-expert integration. revision: yes

Circularity Check

0 steps flagged

No circularity: method design is independent of its claimed outputs

full rationale

The provided manuscript text introduces M²-REPA via a new decoupling step on diffusion features followed by two explicitly designed losses (alignment + modality-specific regularization). No equations appear that equate any reported gain or complementarity to a fitted hyperparameter or prior result by construction. No self-citations are invoked as load-bearing uniqueness theorems, and the central premise (joint optimization exploiting complementary priors) is not reduced to a renaming or ansatz smuggled from the authors' own prior work. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven premise that separate foundation models act as complementary experts and that the new regularization term will enforce useful complementarity; no free parameters or invented entities are quantified in the abstract.

axioms (1)
  • domain assumption Foundation models trained on different modality spaces naturally capture distinct domain-specific priors that are complementary.
    Stated in the abstract as the key insight enabling the method.
invented entities (1)
  • modality-specific decoupling regularization no independent evidence
    purpose: Encourages complementarity across modalities during joint optimization.
    New term introduced to prevent interference between aligned features.

pith-pipeline@v0.9.0 · 5490 in / 1302 out tokens · 22976 ms · 2026-05-10T15:47:58.524895+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 41 canonical work pages · 18 internal anchors

  1. [1]

    Proceedings of the 3rd International Workshop on Rich Media With Generative AI, ACM (2025) 1

    Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. Proceedings of the 3rd International Workshop on Rich Media With Generative AI, ACM (2025) 1

  2. [2]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025) 4

  3. [3]

    Advances in Neural Information Processing Systems37, 24081–24125 (2024) 4, 5

    Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024) 4, 5

  4. [4]

    arXiv preprint arXiv:2506.01103 (2025) 4

    Chen, J., Zhu, H., He, X., Wang, Y., Zhou, J., Chang, W., Zhou, Y., Li, Z., Fu, Z., Pang, J., et al.: Deepverse: 4d autoregressive video generation as a world model. arXiv preprint arXiv:2506.01103 (2025) 4

  5. [5]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Chen, S., Guo, H., Zhu, S., Zhang, F., Huang, Z., Feng, J., Kang, B.: Video depth anything: Consistent depth estimation for super-long videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22831–22840 (2025) 15

  6. [6]

    4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025

    Chen, Z., Liu, T., Zhuo, L., Ren, J., Tao, Z., Zhu, H., Hong, F., Pan, L., Liu, Z.: 4dnex: Feed-forward 4d generative modeling made easy. arXiv preprint arXiv:2508.13154 (2025) 2, 4

  7. [7]

    Cherian,A.,Corcodel,R.,Jain,S.,Romeres,D.:Llmphy:Complexphysicalreason- ingusinglargelanguagemodelsandworldmodels.arXivpreprintarXiv:2411.08027 (2024) 1

  8. [8]

    URL: https://oasis-model

    Decart, E., McIntyre, Q., Campbell, S., Chen, X., Wachen, R.: Oasis: A universe in a transformer. URL: https://oasis-model. github. io (2024) 4 Decoupled Representation Alignment for Multimodal World Models 17

  9. [9]

    Everingham,M.,VanGool,L.,Williams,C.K.,Winn,J.,Zisserman,A.:Thepascal visualobjectclasses(voc)challenge.Internationaljournalofcomputervision88(2), 303–338 (2010) 11

  10. [10]

    Advances in Neural Information Processing Systems37, 91560–91596 (2024) 1

    Gao, S., Yang, J., Chen, L., Chitta, K., Qiu, Y., Geiger, A., Zhang, J., Li, H.: Vista: A generalizable driving world model with high fidelity and versatile controllability. Advances in Neural Information Processing Systems37, 91560–91596 (2024) 1

  11. [11]

    World Models

    Ha, D., Schmidhuber, J.: World models. arXiv preprint arXiv:1803.101222(3) (2018) 1

  12. [12]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model

    He, X., Peng, C., Liu, Z., Wang, B., Zhang, Y., Cui, Q., Kang, F., Jiang, B., An, M., Ren, Y., et al.: Matrix-game 2.0: An open-source real-time and streaming interactive world model. arXiv preprint arXiv:2508.13009 (2025) 1, 4

  13. [13]

    Advances in neural information processing systems33, 6840–6851 (2020) 1

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 1

  14. [14]

    In: International Conference on Machine Learning

    Hoogeboom, E., Heek, J., Salimans, T.: simple diffusion: End-to-end diffusion for high resolution images. In: International Conference on Machine Learning. pp. 13213–13232. PMLR (2023) 8, 11, 15

  15. [15]

    GAIA-1: A Generative World Model for Autonomous Driving

    Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., Corrado, G.: Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080 (2023) 1

  16. [16]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Hu, W., Gao, X., Li, X., Zhao, S., Cun, X., Zhang, Y., Quan, L., Shan, Y.: Depthcrafter: Generating consistent long depth sequences for open-world videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2005–2015 (2025) 11

  17. [17]

    ACM Transactions on Graphics (TOG)44(6), 1–15 (2025) 2, 4

    Huang, T., Zheng, W., Wang, T., Liu, Y., Wang, Z., Wu, J., Jiang, J., Li, H., Lau, R., Zuo, W., et al.: Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation. ACM Transactions on Graphics (TOG)44(6), 1–15 (2025) 2, 4

  18. [18]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025) 4

  19. [19]

    arXiv preprint arXiv:2506.09229 (2025) 2, 4, 7, 8

    Hwang, S., Jang, H., Kim, K., Park, M., Choo, J.: Cross-frame representation alignment for fine-tuning video diffusion models. arXiv preprint arXiv:2506.09229 (2025) 2, 4, 7, 8

  20. [20]

    arXiv preprint arXiv:2412.11673 (2024) 4

    Karypidis, E., Kakogeorgiou, I., Gidaris, S., Komodakis, N.: Dino-foresight: Look- ing into the future with dino. arXiv preprint arXiv:2412.11673 (2024) 4

  21. [21]

    Advances in Neural Information Processing Systems37, 89834–89868 (2024) 4

    Kim, J., Kang, J., Choi, J., Han, B.: Fifo-diffusion: Generating infinite videos from text without training. Advances in Neural Information Processing Systems37, 89834–89868 (2024) 4

  22. [22]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 1

  23. [23]

    In: International conference on machine learning

    Kornblith, S., Norouzi, M., Lee, H., Hinton, G.: Similarity of neural network rep- resentations revisited. In: International conference on machine learning. pp. 3519–

  24. [24]

    PMlR (2019) 3, 6, 9, 14, 25

  25. [25]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 5

  26. [26]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022) 5

  27. [27]

    arXiv preprint arXiv:2508.15720 (2025) 2, 4 18 J

    Liu, Z., Deng, X., Chen, S., Wang, A., Guo, Q., Han, M., Xue, Z., Chen, M., Luo, P., Yang, L.: Worldweaver: Generating long-horizon video worlds via rich perception. arXiv preprint arXiv:2508.15720 (2025) 2, 4 18 J. Xiao et al

  28. [28]

    arXiv preprint arXiv:2510.03104 (2025) 2, 4

    Mei,Z.,Shorinwa,O.,Majumdar,A.:Geometrymeetsvision:Revisitingpretrained semantics in distilled fields. arXiv preprint arXiv:2510.03104 (2025) 2, 4

  29. [29]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 2, 3, 4, 6, 7, 9, 10, 11, 12, 15

  30. [30]

    URL: https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/ (2025) 1, 4

    Parker-Holder, J., Fruchter, S., et al.: Genie 3: A new frontier for world mod- els. URL: https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/ (2025) 1, 4

  31. [31]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 11, 13, 21

  32. [32]

    Wristworld: Generating wrist-views via 4d world models for robotic manipulation.CoRR, abs/2510.07313, 2025

    Qian, Z., Chi, X., Li, Y., Wang, S., Qin, Z., Ju, X., Han, S., Zhang, S.: Wrist- world: Generating wrist-views via 4d world models for robotic manipulation. arXiv preprint arXiv:2510.07313 (2025) 2, 4

  33. [33]

    Worldsimbench: T owards video generation models as world simulators, 2024

    Qin, Y., Shi, Z., Yu, J., Wang, X., Zhou, E., Li, L., Yin, Z., Liu, X., Sheng, L., Shao, J., et al.: Worldsimbench: Towards video generation models as world simulators. arXiv preprint arXiv:2410.18072 (2024) 1

  34. [34]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 2, 4, 6

  35. [35]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) 2, 4, 7, 9, 10, 11, 12

  36. [36]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 11, 21

  37. [37]

    Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

    Russell, L., Hu, A., Bertoni, L., Fedoseev, G., Shotton, J., Arani, E., Corrado, G.: Gaia-2: A controllable multi-view generative world model for autonomous driving. arXiv preprint arXiv:2503.20523 (2025) 1

  38. [38]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025) 4

  39. [39]

    arXiv preprint arXiv:2502.06764 (2025)

    Song, K., Chen, B., Simchowitz, M., Du, Y., Tedrake, R., Sitzmann, V.: History- guided video diffusion. arXiv preprint arXiv:2502.06764 (2025) 4, 7, 10, 12, 21

  40. [40]

    U-repa: Aligning diffu- sion u-nets to vits.arXiv preprint arXiv:2503.18414, 2025

    Tian, Y., Chen, H., Zheng, M., Liang, Y., Xu, C., Wang, Y.: U-repa: Aligning diffusion u-nets to vits. arXiv preprint arXiv:2503.18414 (2025) 2, 7, 15

  41. [41]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018) 11

  42. [42]

    Diffusion models are real- time game engines.arXiv preprint arXiv:2408.14837, 2024

    Valevski, D., Leviathan, Y., Arar, M., Fruchter, S.: Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837 (2024) 1, 4

  43. [43]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 1

  44. [44]

    Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676,

    Wang, J., Yuan, Y., Zheng, R., Lin, Y., Gao, J., Chen, L.Z., Bao, Y., Zhang, Y., Zeng, C., Zhou, Y., et al.: Spatialvid: A large-scale video dataset with spatial annotations. arXiv preprint arXiv:2509.09676 (2025) 4 Decoupled Representation Alignment for Multimodal World Models 19

  45. [45]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025) 2, 4, 6

  46. [46]

    IEEE transactions on image processing 13(4), 600–612 (2004) 11

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004) 11

  47. [47]

    Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

    Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry forcing: Mar- rying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982 (2025) 2, 4, 6, 7, 10, 12

  48. [48]

    arXiv preprint arXiv:2504.12369 (2025)

    Xiao, Z., Lan, Y., Zhou, Y., Ouyang, W., Yang, S., Zeng, Y., Pan, X.: Worldmem: Long-term consistent world simulation with memory. arXiv preprint arXiv:2504.12369 (2025) 4

  49. [49]

    Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025a

    Xu, G., Lin, H., Luo, H., Wang, X., Yao, J., Zhu, L., Pu, Y., Chi, C., Sun, H., Wang, B., et al.: Pixel-perfect depth with semantics-prompted diffusion transform- ers. arXiv preprint arXiv:2510.07316 (2025) 4, 6, 7

  50. [50]

    In: International Conference on Machine Learning

    Yan, W., Hafner, D., James, S., Abbeel, P.: Temporally consistent transformers for video generation. In: International Conference on Machine Learning. pp. 39062– 39098. PMLR (2023) 11, 13, 21

  51. [51]

    Advances in Neural Information Processing Systems37, 21875–21911 (2024) 2, 3, 4, 7, 9, 10, 11, 12, 15

    Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems37, 21875–21911 (2024) 2, 3, 4, 7, 9, 10, 11, 12, 15

  52. [52]

    Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

    Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Schuurmans, D., Abbeel, P.: Learning interactive real-world simulators. arXiv preprint arXiv:2310.061141(2), 6 (2023) 1

  53. [53]

    Video as the new language for real-world decision making.arXiv preprint arXiv:2402.17139, 2024

    Yang, S., Walker, J., Parker-Holder, J., Du, Y., Bruce, J., Barreto, A., Abbeel, P., Schuurmans, D.: Video as the new language for real-world decision making. arXiv preprint arXiv:2402.17139 (2024) 1

  54. [54]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024) 1

  55. [55]

    In: RSS 2025 Workshop: Mobile Manipulation: Emerging Opportunities{\&}Con- temporary Challenges (2025) 4

    Yin, T., Mei, Z., Sun, T., Zha, L., Zhou, E., Bao, J., Yamane, M., Sho, O., Majum- dar, A.: Womap: World models for embodied open-vocabulary object localization. In: RSS 2025 Workshop: Mobile Manipulation: Emerging Opportunities{\&}Con- temporary Challenges (2025) 4

  56. [56]

    In: CVPR (2025) 4

    Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: CVPR (2025) 4

  57. [57]

    arXiv preprint arXiv:2509.07979 (2025) 4

    Yoon, H., Jung, J., Kim, J., Choi, H., Shin, H., Lim, S., An, H., Kim, C., Han, J., Kim, D., et al.: Visual representation alignment for multimodal large language models. arXiv preprint arXiv:2509.07979 (2025) 2, 4

  58. [58]

    Gamefactory: Creating new games with gen- erative interactive videos.arXiv preprint arXiv:2501.08325,

    Yu, J., Qin, Y., Wang, X., Wan, P., Zhang, D., Liu, X.: Gamefactory: Creating new games with generative interactive videos. arXiv preprint arXiv:2501.08325 (2025) 1

  59. [59]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940 (2024) 2, 4, 6, 7, 15, 21, 23

  60. [60]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018) 11 20 J. Xiao et al

  61. [61]

    arXiv preprint arXiv:2505.23656 (2025) 2, 4, 6, 7, 8

    Zhang, X., Liao, J., Zhang, S., Meng, F., Wan, X., Yan, J., Cheng, Y.: Videorepa: Learning physics for video generation through relational alignment with foundation models. arXiv preprint arXiv:2505.23656 (2025) 2, 4, 6, 7, 8

  62. [62]

    Tesseract: Learning 4d embodied world models, 2025

    Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y., Gan, C.: Tesseract: learning 4d embodied world models. arXiv preprint arXiv:2504.20995 (2025) 1, 2, 4, 21

  63. [63]

    Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: Learn- ingviewsynthesisusingmultiplaneimages.arXivpreprintarXiv:1805.09817(2018) 4, 10, 11, 12

  64. [64]

    Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

    Zhou, Y., Wang, Y., Zhou, J., Chang, W., Guo, H., Li, Z., Ma, K., Li, X., Wang, Y., Zhu, H., et al.: Omniworld: A multi-domain and multi-modal dataset for 4d world modeling. arXiv preprint arXiv:2509.12201 (2025) 4

  65. [65]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhu, H., Wang, Y., Zhou, J., Chang, W., Zhou, Y., Li, Z., Chen, J., Shen, C., Pang, J., He, T.: Aether: Geometric-aware unified world modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8535–8546 (2025) 2, 4

  66. [66]

    Zhu, Z., Wang, X., Zhao, W., Min, C., Deng, N., Dou, M., Wang, Y., Shi, B., Wang, K., Zhang, C., et al.: Is sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520 (2024) 1 Decoupled Representation Alignment for Multimodal World Models 21 Supplementary Material A Latent-Space DiT Architecture Deta...