pith. machine review for the scientific record. sign in

arxiv: 2605.07390 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D generationspatiotemporal cognitionworld modelgraph fusion4D Gaussianslatent diffusionphysical consistencytopological consistency
0
0 comments X

The pith

A world model guided by fused 4D cognition graphs produces generations with structural and topological consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that 4D generation achieves better physical consistency when models explicitly build and use 4D spatiotemporal cognition instead of adding only global constraints. It constructs this cognition by turning multimodal features into a global appearance graph and a local dynamic graph, then fuses the two with semantic bridging before feeding the result into a world model that predicts future states to condition diffusion. A reader would care because current video and 3D generators often produce motions that look right yet violate local physical rules such as object continuity or interaction topology. The framework follows four steps: multimodal encoding, graph-based cognition, world-model reasoning, and conditioned 4D Gaussian generation. If correct, the approach supplies a concrete mechanism for embedding intrinsic spatiotemporal regularities directly into the generative process.

Core claim

ST-Gen4D establishes that 4D spatiotemporal cognition, formed by sculpting multimodal representations into a global appearance graph and local dynamic graph and fusing them via semantic-bridged spatiotemporal fusion, when integrated into a world model, allows derivation of future states that condition latent diffusion to generate 4D Gaussians while guaranteeing structural rationality and topological consistency.

What carries the argument

The 4D cognition graph, created by fusing the global appearance graph and local dynamic graph through semantic-bridged spatiotemporal fusion; this graph supplies the condition for world-model reasoning and guides the latent diffusion step.

Load-bearing premise

Sculpting multimodal features into global appearance and local dynamic graphs and fusing them produces a cognition representation that improves physical consistency beyond what standard generative priors already supply.

What would settle it

An ablation that removes the graph construction and fusion steps yet yields equal or higher scores on metrics of local dynamic topology and physical plausibility on the ST-4D dataset would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.07390 by Hanyu Zhou, Haonan Wang, Luxin Yan, Tao Gu.

Figure 1
Figure 1. Figure 1: Paradigm shift of 4D generation. 2D Priors-guided generation mainly relies on 2D video diffusion, leading to collapsed appearance. Extrinsic constraints-guided generation introduces macro scale constraints but suffers from motion artifacts. Our intrinsic 4D cognition-guided paradigm leverages a spatiotemporal cognition to ensure plausible appearance and coherent dynamics. preserved and how local regions mo… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ST-Gen4D. The model first obtains multiple cognition tokens, which are then encoded into structured spatiotemporal cognition graph. We then utilize an autoregressive world model to predict state transition based on the cognition and action tokens. Finally, the deduced state severs as the condition of latent diffusion to generate 4D Gaussians. 2 Related Works 4D Generation. Driven by the success… view at source ↗
Figure 3
Figure 3. Figure 3: Cross-view and cross-time consistency analysis. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Feature robustness and discriminability analysis. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Architectural details of the spatiotemporal structured cognition. (a, b) The construction pipelines for semantic, global, and local graphs from foundational representations. (c) Details of the fusion progress between global and local graphs guided by common semantic graph. with spatial and temporal tokens, respectively. To incorporate 3D geometric awareness, we replace the 2D positional embedding with a 3D… view at source ↗
Figure 6
Figure 6. Figure 6: Composition and construction pipeline of ST-4D. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of Text-to-3D generation on the T3Bench dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of image-to-3D generation on the 3D-Front dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of text-to-4D generation on the ST-4D dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of image-to-4D generation on the DAVIS [ [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablation of separate global and local graphs. The visualization reveal that ablating global or local graphs leads to obvious appearance degradation or motion artifacts. Moreover, the performance metrics significantly declined when ablating either branch. Superiority of Graph Construction. To validate the graph construction, we replace the structured graph formulation with a direct concatenation of all ext… view at source ↗
read the original abstract

Generative models have achieved success in producing apparently coherent 2D videos, but remain challenging in the physical world due to lack of 4D spatiotemporal scale. Typically, existing 4D generative models directly embed macro scale constraints to enhance overall spatiotemporal consistency. However, these methods only ensure global appearance coherence and fail to reveal the local dynamics of the physical world. Our insight is that global appearance structure and local dynamic topology empower 4D spatiotemporal cognition, thereby enabling 4D generation with spatiotemporal regularities. In this work, we propose ST-Gen4D, a 4D generation framework with 4D spatiotemporal cognition-based world model. Our model is guided by four key designs: 1) Spatiotemporal representation. We encode various modalities into multiple representations as a feature basis. 2) Spatiotemporal cognition. We sculpture these representations into global appearance graph and local dynamic graph, and fuse them via semantic-bridged spatiotemporal fusion to obtain a 4D cognition graph. 3) Spatiotemporal reasoning. We utilize a world model to derive future state based on the 4D cognition. 4) Spatiotemporal generation. We leverage the derived cognition as condition to guide latent diffusion for 4D Gaussian generation. By deeply integrating 4D intrinsic cognition with generative priors, our model guarantees the structural rationality and topological consistency of 4D generation. Moreover, we propose ST-4D datasets by aggregating public 4D datasets and self-built subset. Extensive experiments demonstrate the superiority of our ST-Gen4D across 3D and 4D generation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes ST-Gen4D, a 4D generation framework that embeds spatiotemporal cognition into a world model. It encodes multimodal inputs into representations, sculpts them into a global appearance graph and local dynamic graph, fuses them via semantic-bridged spatiotemporal fusion to form a 4D cognition graph, uses a world model to derive future states from this graph, and conditions latent diffusion on the derived cognition to generate 4D Gaussians. The authors introduce the ST-4D dataset by aggregating public sources and self-collected data, and claim that this integration guarantees structural rationality and topological consistency while demonstrating superiority on 3D and 4D generation tasks.

Significance. If the quantitative results and ablations hold, the work could advance 4D generative modeling by explicitly structuring multimodal features into cognition graphs that improve local dynamic consistency beyond standard generative priors. The ST-4D dataset aggregation is a concrete community contribution. However, the significance is limited by the absence of explicit mechanisms (e.g., topological regularizers or physics-informed losses) that would differentiate the claimed guarantee from learned-prior behavior.

major comments (3)
  1. [Abstract] Abstract: The central claim that 'deeply integrating 4D intrinsic cognition with generative priors... guarantees the structural rationality and topological consistency' is load-bearing yet unsupported by any described enforcement mechanism. The semantic-bridged spatiotemporal fusion and world-model derivation steps are described only at a high level without a regularizer, constraint, or loss term that would enforce topology or rationality beyond what the underlying diffusion prior already provides.
  2. [Method] Method (spatiotemporal cognition and reasoning sections): No explicit formulation is given for how the 4D cognition graph is used to derive future states in the world model or to condition the latent diffusion; without an equation or pseudocode showing the conditioning operator or any topology-preserving term, it is unclear whether the graph fusion produces measurable gains over standard priors.
  3. [Experiments] Experiments: The abstract asserts 'extensive experiments demonstrate the superiority' across 3D and 4D tasks, yet supplies no quantitative metrics, baseline comparisons, or ablation isolating the contribution of the global/local graph fusion versus the world model alone. This absence prevents verification that the cognition component is the source of any reported improvement in physical consistency.
minor comments (2)
  1. [Method] Notation for the global appearance graph and local dynamic graph is introduced without a clear diagram or formal definition of nodes/edges, making it difficult to reproduce the sculpting and fusion steps.
  2. [Datasets] The ST-4D dataset construction is mentioned but lacks details on train/test splits, annotation protocols, or licensing, which should be added for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity around our claims, formulations, and experimental evidence. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'deeply integrating 4D intrinsic cognition with generative priors... guarantees the structural rationality and topological consistency' is load-bearing yet unsupported by any described enforcement mechanism. The semantic-bridged spatiotemporal fusion and world-model derivation steps are described only at a high level without a regularizer, constraint, or loss term that would enforce topology or rationality beyond what the underlying diffusion prior already provides.

    Authors: We acknowledge that the abstract's phrasing of 'guarantees' is strong and could be interpreted as implying an explicit enforcement mechanism such as a dedicated regularizer. The manuscript relies on the structured 4D cognition graph (formed via semantic-bridged fusion of global appearance and local dynamic graphs) and the world model's state derivation to promote consistency through representation design rather than an additional loss term. We agree this distinction should be clearer. In the revision we will reword the abstract to state that the integration 'enhances' structural rationality and topological consistency, and we will add a brief reference to the design choices in Sections 3.2 and 3.3. revision: yes

  2. Referee: [Method] Method (spatiotemporal cognition and reasoning sections): No explicit formulation is given for how the 4D cognition graph is used to derive future states in the world model or to condition the latent diffusion; without an equation or pseudocode showing the conditioning operator or any topology-preserving term, it is unclear whether the graph fusion produces measurable gains over standard priors.

    Authors: The full manuscript describes the world model in Section 3.3 as using the 4D cognition graph to derive future states via graph-based propagation, and the generation step in Section 3.4 conditions the latent diffusion through feature injection. However, we agree that the absence of explicit equations or pseudocode makes the conditioning operator and any implicit topology preservation difficult to verify. We will add the mathematical formulation for state derivation (e.g., the GNN update rule on the cognition graph) and the cross-attention-based conditioning operator in the revised method section, along with pseudocode for the overall pipeline. revision: yes

  3. Referee: [Experiments] Experiments: The abstract asserts 'extensive experiments demonstrate the superiority' across 3D and 4D tasks, yet supplies no quantitative metrics, baseline comparisons, or ablation isolating the contribution of the global/local graph fusion versus the world model alone. This absence prevents verification that the cognition component is the source of any reported improvement in physical consistency.

    Authors: The manuscript reports quantitative results on the ST-4D dataset in Section 4, including comparisons against baselines such as 4DGen and others using metrics like FID, FVD, and spatiotemporal consistency scores (Tables 1–2), with ablations on graph fusion and world model components in Section 4.3. We apologize if these were not sufficiently prominent or if the referee could not locate the isolating ablations. To address the concern directly, we will expand the main experiments section to foreground the key metrics, baseline tables, and component ablations, ensuring the contribution of the cognition graph is explicitly quantified. revision: partial

Circularity Check

0 steps flagged

No circularity: framework description does not reduce claims to self-defined inputs or fitted predictions.

full rationale

The provided abstract and design outline present ST-Gen4D as a modular pipeline that encodes multimodal features into graphs, fuses them into a cognition representation, reasons via a world model, and conditions a latent diffusion process. No equations, parameter-fitting steps, or self-citations are described that would make any output (such as the claimed guarantee of structural rationality) equivalent to the inputs by construction. The central claim is framed as an empirical outcome of integration with existing generative priors rather than a definitional or fitted tautology, leaving the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The abstract introduces several new conceptual entities without providing independent evidence or derivations for them.

axioms (1)
  • domain assumption Global appearance structure and local dynamic topology together empower 4D spatiotemporal cognition.
    Stated as the core insight guiding the four key designs.
invented entities (2)
  • 4D cognition graph no independent evidence
    purpose: Fused representation that encodes both global appearance and local dynamics for world-model reasoning.
    Introduced as the output of the spatiotemporal cognition stage.
  • ST-4D datasets no independent evidence
    purpose: Aggregated training and evaluation data for 4D generation.
    Mentioned as a contribution but no access or construction details given.

pith-pipeline@v0.9.0 · 5605 in / 1388 out tokens · 38936 ms · 2026-05-11T01:52:54.083860+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 12 internal anchors

  1. [1]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022

  2. [2]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

  3. [3]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

  4. [4]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

  5. [5]

    Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

    Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

  6. [6]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

  7. [7]

    arXiv preprint arXiv:2311.02848 , year=

    Yanqin Jiang, Li Zhang, Jin Gao, Weimin Hu, and Yao Yao. Consistent4d: Consistent 360 {\deg} dynamic object generation from monocular video.arXiv preprint arXiv:2311.02848, 2023

  8. [8]

    arXiv preprint arXiv:2311.14603 , year=

    Yuyang Zhao, Zhiwen Yan, Enze Xie, Lanqing Hong, Zhenguo Li, and Gim Hee Lee. An- imate124: Animating one image to 4d dynamic scene.arXiv preprint arXiv:2311.14603, 2023

  9. [9]

    Worldsplat: Gaussian-centric feed-forward 4d scene generation for autonomous driving.arXiv preprint arXiv:2509.23402, 2025

    Ziyue Zhu, Zhanqian Wu, Zhenxin Zhu, Lijun Zhou, Haiyang Sun, Bing Wan, Kun Ma, Guang Chen, Hangjun Ye, Jin Xie, et al. Worldsplat: Gaussian-centric feed-forward 4d scene generation for autonomous driving.arXiv preprint arXiv:2509.23402, 2025

  10. [10]

    Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models

    Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fidler, and Karsten Kreis. Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8576–8588, 2024

  11. [11]

    4d-fy: Text-to-4d generation using hybrid score distillation sampling

    Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7996–8006, 2024

  12. [12]

    Choreographing a world of dynamic objects.arXiv preprint arXiv:2601.04194, 2026

    Yanzhe Lyu, Chen Geng, Karthik Dharmarajan, Yunzhi Zhang, Hadi Alzayer, Shangzhe Wu, and Jiajun Wu. Choreographing a world of dynamic objects.arXiv preprint arXiv:2601.04194, 2026

  13. [13]

    4d-vggt: A general foundation model with spatiotemporal awareness for dynamic scene geometry estimation.arXiv preprint arXiv:2511.18416, 2025

    Haonan Wang, Hanyu Zhou, Haoyue Liu, and Luxin Yan. 4d-vggt: A general foundation model with spatiotemporal awareness for dynamic scene geometry estimation.arXiv preprint arXiv:2511.18416, 2025. 10

  14. [14]

    arXiv preprint arXiv:2509.19296 (2025)

    Sherwin Bahmani, Tianchang Shen, Jiawei Ren, Jiahui Huang, Yifeng Jiang, Haithem Turki, Andrea Tagliasacchi, David B Lindell, Zan Gojcic, Sanja Fidler, et al. Lyra: Generative 3d scene reconstruction via video diffusion model self-distillation.arXiv preprint arXiv:2509.19296, 2025

  15. [15]

    A unified approach for text-and image-guided 4d scene generation

    Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Otmar Hilliges, and Shalini De Mello. A unified approach for text-and image-guided 4d scene generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7300–7309, 2024

  16. [16]

    Free4d: Tuning-free 4d scene generation with spatial-temporal consistency

    Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, and Ziwei Liu. Free4d: Tuning-free 4d scene generation with spatial-temporal consistency. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25571–25582, 2025

  17. [17]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022

  18. [18]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022

  19. [19]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

  20. [20]

    LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

    Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworld- model: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

  21. [21]

    Gwm: Towards scalable gaussian world models for robotic manipulation

    Guanxing Lu, Baoxiong Jia, Puhao Li, Yixin Chen, Ziwei Wang, Yansong Tang, and Siyuan Huang. Gwm: Towards scalable gaussian world models for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9263–9274, 2025

  22. [22]

    Teleworld: Towards dynamic multimodal synthesis with a 4d world model.arXiv preprint arXiv:2601.00051, 2025

    Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Chen, Junfei Cheng, Zixiao Gu, Yuyang Huang, Zicheng Jiang, Wei Li, Tian Li, et al. Teleworld: Towards dynamic multimodal synthesis with a 4d world model.arXiv preprint arXiv:2601.00051, 2025

  23. [23]

    Neoverse: Enhancing 4d world model with in-the-wild monocular videos.arXiv preprint arXiv:2601.00393, 2026

    Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, and Zhaoxiang Zhang. Neoverse: Enhancing 4d world model with in-the-wild monocular videos.arXiv preprint arXiv:2601.00393, 2026

  24. [24]

    InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

    InSpatio Team, Xiaoyu Zhang, Weihong Pan, Zhichao Ye, Jialin Liu, Yipeng Chen, Nan Wang, Xiaojun Xiang, Weijian Xie, Yifu Wang, et al. Inspatio-worldfm: An open-source real-time generative frame model.arXiv preprint arXiv:2603.11911, 2026

  25. [25]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  26. [26]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  27. [27]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  28. [28]

    Internvideo2: Scaling foundation models for multimodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. InEuropean conference on computer vision, pages 396–416. Springer, 2024. 11

  29. [29]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  30. [30]

    ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real- world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

  31. [31]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

  32. [32]

    Scannet++: A high- fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high- fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

  33. [33]

    Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Information Processing Systems, 36:35799–35813, 2023

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Information Processing Systems, 36:35799–35813, 2023

  34. [34]

    Stereo Magnification: Learning View Synthesis using Multiplane Images

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

  35. [35]

    Megadepth: Learning single-view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018

  36. [36]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

  37. [37]

    3d-front: 3d furnished rooms with layouts and semantics

    Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10933– 10942, 2021

  38. [38]

    Hoi4d: A 4d egocentric dataset for category-level human-object interaction

    Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21013–21022, 2022

  39. [39]

    Dynamicstereo: Consistent dynamic depth from stereo videos

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dynamicstereo: Consistent dynamic depth from stereo videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13229–13239, 2023

  40. [40]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020

  41. [41]

    Virtual KITTI 2

    Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020

  42. [42]

    Tartanair: A dataset to push the limits of visual slam

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020. 12

  43. [43]

    Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion

    Michael J Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang. Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8726–8737, 2023

  44. [44]

    Pointodyssey: A large-scale synthetic dataset for long-term point tracking

    Yang Zheng, Adam W Harley, Bokui Shen, Gordon Wetzstein, and Leonidas J Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19855–19865, 2023

  45. [45]

    Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025

    Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, et al. Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025

  46. [46]

    T3 bench: Benchmarking current progress in text-to-3d generation.arXiv preprint arXiv:2310.02977, 2023

    Yuze He, Yushi Bai, Matthieu Lin, Wang Zhao, Yubin Hu, Jenny Sheng, Ran Yi, Juanzi Li, and Yong-Jin Liu. T3 bench: Benchmarking current progress in text-to-3d generation.arXiv preprint arXiv:2310.02977, 2023

  47. [47]

    Magic3d: High-resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 300–309, 2023

  48. [48]

    Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation

    Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12619–12629, 2023

  49. [49]

    Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in neural information processing systems, 36:8406–8441, 2023

  50. [50]

    Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models

    Taoran Yi, Jiemin Fang, Junjie Wang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6796–6807, 2024

  51. [51]

    Cog2gen3d: Sculpturing 3d semantic-geometric cognition for 3d generation.arXiv preprint arXiv:2603.05845, 2026

    Haonan Wang, Hanyu Zhou, Haoyue Liu, Tao Gu, and Luxin Yan. Cog2gen3d: Sculpturing 3d semantic-geometric cognition for 3d generation.arXiv preprint arXiv:2603.05845, 2026

  52. [52]

    LucidDreamer: Domain- free generation of 3D gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

    Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Lu- ciddreamer: Domain-free generation of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

  53. [53]

    Gen3dsr: Generalizable 3d scene recon- struction via divide and conquer from a single view

    Andreea Ardelean, Mert Özer, and Bernhard Egger. Gen3dsr: Generalizable 3d scene recon- struction via divide and conquer from a single view. In2025 International Conference on 3D Vision (3DV), pages 616–626. IEEE, 2025

  54. [54]

    Echoscene: Indoor scene generation via information echo over scene graph diffusion

    Guangyao Zhai, Evin Pınar Örnek, Dave Zhenyu Chen, Ruotong Liao, Yan Di, Nassir Navab, Federico Tombari, and Benjamin Busam. Echoscene: Indoor scene generation via information echo over scene graph diffusion. InEuropean Conference on Computer Vision, pages 167–184. Springer, 2024

  55. [55]

    Layout2scene: 3d semantic layout guided scene generation via geometry and appearance diffusion priors.arXiv preprint arXiv:2501.02519, 2025

    Minglin Chen, Longguang Wang, Sheng Ao, Ye Zhang, Kai Xu, and Yulan Guo. Layout2scene: 3d semantic layout guided scene generation via geometry and appearance diffusion priors.arXiv preprint arXiv:2501.02519, 2025

  56. [56]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  57. [57]

    4real: Towards photorealistic 4d scene generation via video diffusion models.Advances in Neural Information Processing Systems, 37:45256–45280, 2024

    Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Laszlo A Jeni, Sergey Tulyakov, and Hsin-Ying Lee. 4real: Towards photorealistic 4d scene generation via video diffusion models.Advances in Neural Information Processing Systems, 37:45256–45280, 2024. 13

  58. [58]

    Genxd: Generating any 3d and 4d scenes.arXiv preprint arXiv:2411.02319, 2024

    Yuyang Zhao, Chung-Ching Lin, Kevin Lin, Zhiwen Yan, Linjie Li, Zhengyuan Yang, Jianfeng Wang, Gim Hee Lee, and Lijuan Wang. Genxd: Generating any 3d and 4d scenes.arXiv preprint arXiv:2411.02319, 2024

  59. [59]

    Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion.arXiv preprint arXiv:2411.04928, 2024

    Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, and Yikai Wang. Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion. arXiv preprint arXiv:2411.04928, 2024

  60. [60]

    4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025

    Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, and Ziwei Liu. 4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025

  61. [61]

    arXiv preprint arXiv:2511.18922 (2025)

    Zhenxing Mi, Yuxin Wang, and Dan Xu. One4d: Unified 4d generation and reconstruction via decoupled lora control.arXiv preprint arXiv:2511.18922, 2025

  62. [62]

    The 2017 DAVIS Challenge on Video Object Segmentation

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 14