arxiv: 2605.07390 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation

Haonan Wang , Hanyu Zhou , Tao Gu , Luxin Yan

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords 4D generationspatiotemporal cognitionworld modelgraph fusion4D Gaussianslatent diffusionphysical consistencytopological consistency

0 comments

The pith

A world model guided by fused 4D cognition graphs produces generations with structural and topological consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that 4D generation achieves better physical consistency when models explicitly build and use 4D spatiotemporal cognition instead of adding only global constraints. It constructs this cognition by turning multimodal features into a global appearance graph and a local dynamic graph, then fuses the two with semantic bridging before feeding the result into a world model that predicts future states to condition diffusion. A reader would care because current video and 3D generators often produce motions that look right yet violate local physical rules such as object continuity or interaction topology. The framework follows four steps: multimodal encoding, graph-based cognition, world-model reasoning, and conditioned 4D Gaussian generation. If correct, the approach supplies a concrete mechanism for embedding intrinsic spatiotemporal regularities directly into the generative process.

Core claim

ST-Gen4D establishes that 4D spatiotemporal cognition, formed by sculpting multimodal representations into a global appearance graph and local dynamic graph and fusing them via semantic-bridged spatiotemporal fusion, when integrated into a world model, allows derivation of future states that condition latent diffusion to generate 4D Gaussians while guaranteeing structural rationality and topological consistency.

What carries the argument

The 4D cognition graph, created by fusing the global appearance graph and local dynamic graph through semantic-bridged spatiotemporal fusion; this graph supplies the condition for world-model reasoning and guides the latent diffusion step.

Load-bearing premise

Sculpting multimodal features into global appearance and local dynamic graphs and fusing them produces a cognition representation that improves physical consistency beyond what standard generative priors already supply.

What would settle it

An ablation that removes the graph construction and fusion steps yet yields equal or higher scores on metrics of local dynamic topology and physical plausibility on the ST-4D dataset would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.07390 by Hanyu Zhou, Haonan Wang, Luxin Yan, Tao Gu.

**Figure 1.** Figure 1: Paradigm shift of 4D generation. 2D Priors-guided generation mainly relies on 2D video diffusion, leading to collapsed appearance. Extrinsic constraints-guided generation introduces macro scale constraints but suffers from motion artifacts. Our intrinsic 4D cognition-guided paradigm leverages a spatiotemporal cognition to ensure plausible appearance and coherent dynamics. preserved and how local regions mo… view at source ↗

**Figure 2.** Figure 2: Overview of ST-Gen4D. The model first obtains multiple cognition tokens, which are then encoded into structured spatiotemporal cognition graph. We then utilize an autoregressive world model to predict state transition based on the cognition and action tokens. Finally, the deduced state severs as the condition of latent diffusion to generate 4D Gaussians. 2 Related Works 4D Generation. Driven by the success… view at source ↗

**Figure 3.** Figure 3: Cross-view and cross-time consistency analysis. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Feature robustness and discriminability analysis. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Architectural details of the spatiotemporal structured cognition. (a, b) The construction pipelines for semantic, global, and local graphs from foundational representations. (c) Details of the fusion progress between global and local graphs guided by common semantic graph. with spatial and temporal tokens, respectively. To incorporate 3D geometric awareness, we replace the 2D positional embedding with a 3D… view at source ↗

**Figure 6.** Figure 6: Composition and construction pipeline of ST-4D. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of Text-to-3D generation on the T3Bench dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of image-to-3D generation on the 3D-Front dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of text-to-4D generation on the ST-4D dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of image-to-4D generation on the DAVIS [ [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation of separate global and local graphs. The visualization reveal that ablating global or local graphs leads to obvious appearance degradation or motion artifacts. Moreover, the performance metrics significantly declined when ablating either branch. Superiority of Graph Construction. To validate the graph construction, we replace the structured graph formulation with a direct concatenation of all ext… view at source ↗

read the original abstract

Generative models have achieved success in producing apparently coherent 2D videos, but remain challenging in the physical world due to lack of 4D spatiotemporal scale. Typically, existing 4D generative models directly embed macro scale constraints to enhance overall spatiotemporal consistency. However, these methods only ensure global appearance coherence and fail to reveal the local dynamics of the physical world. Our insight is that global appearance structure and local dynamic topology empower 4D spatiotemporal cognition, thereby enabling 4D generation with spatiotemporal regularities. In this work, we propose ST-Gen4D, a 4D generation framework with 4D spatiotemporal cognition-based world model. Our model is guided by four key designs: 1) Spatiotemporal representation. We encode various modalities into multiple representations as a feature basis. 2) Spatiotemporal cognition. We sculpture these representations into global appearance graph and local dynamic graph, and fuse them via semantic-bridged spatiotemporal fusion to obtain a 4D cognition graph. 3) Spatiotemporal reasoning. We utilize a world model to derive future state based on the 4D cognition. 4) Spatiotemporal generation. We leverage the derived cognition as condition to guide latent diffusion for 4D Gaussian generation. By deeply integrating 4D intrinsic cognition with generative priors, our model guarantees the structural rationality and topological consistency of 4D generation. Moreover, we propose ST-4D datasets by aggregating public 4D datasets and self-built subset. Extensive experiments demonstrate the superiority of our ST-Gen4D across 3D and 4D generation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ST-Gen4D adds a graph-fusion cognition step before the world model and diffusion, but the guarantee of better topological consistency lacks isolating evidence.

read the letter

The main takeaway is that this paper builds a four-stage pipeline for 4D generation: multimodal features get turned into global appearance and local dynamic graphs, fused via semantic bridging into a cognition graph, fed to a world model for future-state reasoning, and then used to condition 4D Gaussian diffusion. They also release an aggregated ST-4D dataset. That specific combination of graph cognition and world-model conditioning is the clearest new element, even if the individual pieces draw from existing graph and generative work.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes ST-Gen4D, a 4D generation framework that embeds spatiotemporal cognition into a world model. It encodes multimodal inputs into representations, sculpts them into a global appearance graph and local dynamic graph, fuses them via semantic-bridged spatiotemporal fusion to form a 4D cognition graph, uses a world model to derive future states from this graph, and conditions latent diffusion on the derived cognition to generate 4D Gaussians. The authors introduce the ST-4D dataset by aggregating public sources and self-collected data, and claim that this integration guarantees structural rationality and topological consistency while demonstrating superiority on 3D and 4D generation tasks.

Significance. If the quantitative results and ablations hold, the work could advance 4D generative modeling by explicitly structuring multimodal features into cognition graphs that improve local dynamic consistency beyond standard generative priors. The ST-4D dataset aggregation is a concrete community contribution. However, the significance is limited by the absence of explicit mechanisms (e.g., topological regularizers or physics-informed losses) that would differentiate the claimed guarantee from learned-prior behavior.

major comments (3)

[Abstract] Abstract: The central claim that 'deeply integrating 4D intrinsic cognition with generative priors... guarantees the structural rationality and topological consistency' is load-bearing yet unsupported by any described enforcement mechanism. The semantic-bridged spatiotemporal fusion and world-model derivation steps are described only at a high level without a regularizer, constraint, or loss term that would enforce topology or rationality beyond what the underlying diffusion prior already provides.
[Method] Method (spatiotemporal cognition and reasoning sections): No explicit formulation is given for how the 4D cognition graph is used to derive future states in the world model or to condition the latent diffusion; without an equation or pseudocode showing the conditioning operator or any topology-preserving term, it is unclear whether the graph fusion produces measurable gains over standard priors.
[Experiments] Experiments: The abstract asserts 'extensive experiments demonstrate the superiority' across 3D and 4D tasks, yet supplies no quantitative metrics, baseline comparisons, or ablation isolating the contribution of the global/local graph fusion versus the world model alone. This absence prevents verification that the cognition component is the source of any reported improvement in physical consistency.

minor comments (2)

[Method] Notation for the global appearance graph and local dynamic graph is introduced without a clear diagram or formal definition of nodes/edges, making it difficult to reproduce the sculpting and fusion steps.
[Datasets] The ST-4D dataset construction is mentioned but lacks details on train/test splits, annotation protocols, or licensing, which should be added for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity around our claims, formulations, and experimental evidence. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'deeply integrating 4D intrinsic cognition with generative priors... guarantees the structural rationality and topological consistency' is load-bearing yet unsupported by any described enforcement mechanism. The semantic-bridged spatiotemporal fusion and world-model derivation steps are described only at a high level without a regularizer, constraint, or loss term that would enforce topology or rationality beyond what the underlying diffusion prior already provides.

Authors: We acknowledge that the abstract's phrasing of 'guarantees' is strong and could be interpreted as implying an explicit enforcement mechanism such as a dedicated regularizer. The manuscript relies on the structured 4D cognition graph (formed via semantic-bridged fusion of global appearance and local dynamic graphs) and the world model's state derivation to promote consistency through representation design rather than an additional loss term. We agree this distinction should be clearer. In the revision we will reword the abstract to state that the integration 'enhances' structural rationality and topological consistency, and we will add a brief reference to the design choices in Sections 3.2 and 3.3. revision: yes
Referee: [Method] Method (spatiotemporal cognition and reasoning sections): No explicit formulation is given for how the 4D cognition graph is used to derive future states in the world model or to condition the latent diffusion; without an equation or pseudocode showing the conditioning operator or any topology-preserving term, it is unclear whether the graph fusion produces measurable gains over standard priors.

Authors: The full manuscript describes the world model in Section 3.3 as using the 4D cognition graph to derive future states via graph-based propagation, and the generation step in Section 3.4 conditions the latent diffusion through feature injection. However, we agree that the absence of explicit equations or pseudocode makes the conditioning operator and any implicit topology preservation difficult to verify. We will add the mathematical formulation for state derivation (e.g., the GNN update rule on the cognition graph) and the cross-attention-based conditioning operator in the revised method section, along with pseudocode for the overall pipeline. revision: yes
Referee: [Experiments] Experiments: The abstract asserts 'extensive experiments demonstrate the superiority' across 3D and 4D tasks, yet supplies no quantitative metrics, baseline comparisons, or ablation isolating the contribution of the global/local graph fusion versus the world model alone. This absence prevents verification that the cognition component is the source of any reported improvement in physical consistency.

Authors: The manuscript reports quantitative results on the ST-4D dataset in Section 4, including comparisons against baselines such as 4DGen and others using metrics like FID, FVD, and spatiotemporal consistency scores (Tables 1–2), with ablations on graph fusion and world model components in Section 4.3. We apologize if these were not sufficiently prominent or if the referee could not locate the isolating ablations. To address the concern directly, we will expand the main experiments section to foreground the key metrics, baseline tables, and component ablations, ensuring the contribution of the cognition graph is explicitly quantified. revision: partial

Circularity Check

0 steps flagged

No circularity: framework description does not reduce claims to self-defined inputs or fitted predictions.

full rationale

The provided abstract and design outline present ST-Gen4D as a modular pipeline that encodes multimodal features into graphs, fuses them into a cognition representation, reasons via a world model, and conditions a latent diffusion process. No equations, parameter-fitting steps, or self-citations are described that would make any output (such as the claimed guarantee of structural rationality) equivalent to the inputs by construction. The central claim is framed as an empirical outcome of integration with existing generative priors rather than a definitional or fitted tautology, leaving the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The abstract introduces several new conceptual entities without providing independent evidence or derivations for them.

axioms (1)

domain assumption Global appearance structure and local dynamic topology together empower 4D spatiotemporal cognition.
Stated as the core insight guiding the four key designs.

invented entities (2)

4D cognition graph no independent evidence
purpose: Fused representation that encodes both global appearance and local dynamics for world-model reasoning.
Introduced as the output of the spatiotemporal cognition stage.
ST-4D datasets no independent evidence
purpose: Aggregated training and evaluation data for 4D generation.
Mentioned as a contribution but no access or construction details given.

pith-pipeline@v0.9.0 · 5605 in / 1388 out tokens · 38936 ms · 2026-05-11T01:52:54.083860+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We sculpture these representations into global appearance graph and local dynamic graph, and fuse them via semantic-bridged spatiotemporal fusion to obtain a 4D cognition graph... utilize a world model to derive future state... guide latent diffusion for 4D Gaussian generation.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By deeply integrating 4D intrinsic cognition with generative priors, our model guarantees the structural rationality and topological consistency of 4D generation.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 12 internal anchors

[1]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review arXiv 2022
[2]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

work page 2023
[4]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review arXiv 2022
[5]

Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

work page arXiv 2022
[6]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

work page 2024
[7]

arXiv preprint arXiv:2311.02848 , year=

Yanqin Jiang, Li Zhang, Jin Gao, Weimin Hu, and Yao Yao. Consistent4d: Consistent 360 {\deg} dynamic object generation from monocular video.arXiv preprint arXiv:2311.02848, 2023

work page arXiv 2023
[8]

arXiv preprint arXiv:2311.14603 , year=

Yuyang Zhao, Zhiwen Yan, Enze Xie, Lanqing Hong, Zhenguo Li, and Gim Hee Lee. An- imate124: Animating one image to 4d dynamic scene.arXiv preprint arXiv:2311.14603, 2023

work page arXiv 2023
[9]

Worldsplat: Gaussian-centric feed-forward 4d scene generation for autonomous driving.arXiv preprint arXiv:2509.23402, 2025

Ziyue Zhu, Zhanqian Wu, Zhenxin Zhu, Lijun Zhou, Haiyang Sun, Bing Wan, Kun Ma, Guang Chen, Hangjun Ye, Jin Xie, et al. Worldsplat: Gaussian-centric feed-forward 4d scene generation for autonomous driving.arXiv preprint arXiv:2509.23402, 2025

work page arXiv 2025
[10]

Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models

Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fidler, and Karsten Kreis. Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8576–8588, 2024

work page 2024
[11]

4d-fy: Text-to-4d generation using hybrid score distillation sampling

Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7996–8006, 2024

work page 2024
[12]

Choreographing a world of dynamic objects.arXiv preprint arXiv:2601.04194, 2026

Yanzhe Lyu, Chen Geng, Karthik Dharmarajan, Yunzhi Zhang, Hadi Alzayer, Shangzhe Wu, and Jiajun Wu. Choreographing a world of dynamic objects.arXiv preprint arXiv:2601.04194, 2026

work page arXiv 2026
[13]

4d-vggt: A general foundation model with spatiotemporal awareness for dynamic scene geometry estimation.arXiv preprint arXiv:2511.18416, 2025

Haonan Wang, Hanyu Zhou, Haoyue Liu, and Luxin Yan. 4d-vggt: A general foundation model with spatiotemporal awareness for dynamic scene geometry estimation.arXiv preprint arXiv:2511.18416, 2025. 10

work page arXiv 2025
[14]

arXiv preprint arXiv:2509.19296 (2025)

Sherwin Bahmani, Tianchang Shen, Jiawei Ren, Jiahui Huang, Yifeng Jiang, Haithem Turki, Andrea Tagliasacchi, David B Lindell, Zan Gojcic, Sanja Fidler, et al. Lyra: Generative 3d scene reconstruction via video diffusion model self-distillation.arXiv preprint arXiv:2509.19296, 2025

work page arXiv 2025
[15]

A unified approach for text-and image-guided 4d scene generation

Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Otmar Hilliges, and Shalini De Mello. A unified approach for text-and image-guided 4d scene generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7300–7309, 2024

work page 2024
[16]

Free4d: Tuning-free 4d scene generation with spatial-temporal consistency

Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, and Ziwei Liu. Free4d: Tuning-free 4d scene generation with spatial-temporal consistency. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25571–25582, 2025

work page 2025
[17]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022

work page internal anchor Pith review arXiv 2022
[18]

A path towards autonomous machine intelligence version 0.9

Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022

work page 2022
[19]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review arXiv 2024
[20]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworld- model: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

work page internal anchor Pith review arXiv 2026
[21]

Gwm: Towards scalable gaussian world models for robotic manipulation

Guanxing Lu, Baoxiong Jia, Puhao Li, Yixin Chen, Ziwei Wang, Yansong Tang, and Siyuan Huang. Gwm: Towards scalable gaussian world models for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9263–9274, 2025

work page 2025
[22]

Teleworld: Towards dynamic multimodal synthesis with a 4d world model.arXiv preprint arXiv:2601.00051, 2025

Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Chen, Junfei Cheng, Zixiao Gu, Yuyang Huang, Zicheng Jiang, Wei Li, Tian Li, et al. Teleworld: Towards dynamic multimodal synthesis with a 4d world model.arXiv preprint arXiv:2601.00051, 2025

work page arXiv 2025
[23]

Neoverse: Enhancing 4d world model with in-the-wild monocular videos.arXiv preprint arXiv:2601.00393, 2026

Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, and Zhaoxiang Zhang. Neoverse: Enhancing 4d world model with in-the-wild monocular videos.arXiv preprint arXiv:2601.00393, 2026

work page arXiv 2026
[24]

InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

InSpatio Team, Xiaoyu Zhang, Weihong Pan, Zhichao Ye, Jialin Liu, Yipeng Chen, Nan Wang, Xiaojun Xiang, Weijian Xie, Yifu Wang, et al. Inspatio-worldfm: An open-source real-time generative frame model.arXiv preprint arXiv:2603.11911, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[27]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[28]

Internvideo2: Scaling foundation models for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. InEuropean conference on computer vision, pages 396–416. Springer, 2024. 11

work page 2024
[29]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

work page 2024
[30]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real- world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

work page internal anchor Pith review arXiv 2021
[31]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

work page 2017
[32]

Scannet++: A high- fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high- fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

work page 2023
[33]

Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Information Processing Systems, 36:35799–35813, 2023

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Information Processing Systems, 36:35799–35813, 2023

work page 2023
[34]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

work page internal anchor Pith review arXiv 2018
[35]

Megadepth: Learning single-view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018

work page 2041
[36]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

work page 2024
[37]

3d-front: 3d furnished rooms with layouts and semantics

Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10933– 10942, 2021

work page 2021
[38]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction

Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21013–21022, 2022

work page 2022
[39]

Dynamicstereo: Consistent dynamic depth from stereo videos

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dynamicstereo: Consistent dynamic depth from stereo videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13229–13239, 2023

work page 2023
[40]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020

work page 2020
[41]

Virtual KITTI 2

Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020

work page internal anchor Pith review arXiv 2001
[42]

Tartanair: A dataset to push the limits of visual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020. 12

work page 2020
[43]

Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion

Michael J Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang. Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8726–8737, 2023

work page 2023
[44]

Pointodyssey: A large-scale synthetic dataset for long-term point tracking

Yang Zheng, Adam W Harley, Bokui Shen, Gordon Wetzstein, and Leonidas J Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19855–19865, 2023

work page 2023
[45]

Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025

Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, et al. Spatialvid: A large-scale video dataset with spatial annotations.arXiv preprint arXiv:2509.09676, 2025

work page arXiv 2025
[46]

T3 bench: Benchmarking current progress in text-to-3d generation.arXiv preprint arXiv:2310.02977, 2023

Yuze He, Yushi Bai, Matthieu Lin, Wang Zhao, Yubin Hu, Jenny Sheng, Ran Yi, Juanzi Li, and Yong-Jin Liu. T3 bench: Benchmarking current progress in text-to-3d generation.arXiv preprint arXiv:2310.02977, 2023

work page arXiv 2023
[47]

Magic3d: High-resolution text-to-3d content creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 300–309, 2023

work page 2023
[48]

Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation

Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12619–12629, 2023

work page 2023
[49]

Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in neural information processing systems, 36:8406–8441, 2023

work page 2023
[50]

Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models

Taoran Yi, Jiemin Fang, Junjie Wang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6796–6807, 2024

work page 2024
[51]

Cog2gen3d: Sculpturing 3d semantic-geometric cognition for 3d generation.arXiv preprint arXiv:2603.05845, 2026

Haonan Wang, Hanyu Zhou, Haoyue Liu, Tao Gu, and Luxin Yan. Cog2gen3d: Sculpturing 3d semantic-geometric cognition for 3d generation.arXiv preprint arXiv:2603.05845, 2026

work page arXiv 2026
[52]

LucidDreamer: Domain- free generation of 3D gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Lu- ciddreamer: Domain-free generation of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

work page arXiv 2023
[53]

Gen3dsr: Generalizable 3d scene recon- struction via divide and conquer from a single view

Andreea Ardelean, Mert Özer, and Bernhard Egger. Gen3dsr: Generalizable 3d scene recon- struction via divide and conquer from a single view. In2025 International Conference on 3D Vision (3DV), pages 616–626. IEEE, 2025

work page 2025
[54]

Echoscene: Indoor scene generation via information echo over scene graph diffusion

Guangyao Zhai, Evin Pınar Örnek, Dave Zhenyu Chen, Ruotong Liao, Yan Di, Nassir Navab, Federico Tombari, and Benjamin Busam. Echoscene: Indoor scene generation via information echo over scene graph diffusion. InEuropean Conference on Computer Vision, pages 167–184. Springer, 2024

work page 2024
[55]

Layout2scene: 3d semantic layout guided scene generation via geometry and appearance diffusion priors.arXiv preprint arXiv:2501.02519, 2025

Minglin Chen, Longguang Wang, Sheng Ao, Ye Zhang, Kai Xu, and Yulan Guo. Layout2scene: 3d semantic layout guided scene generation via geometry and appearance diffusion priors.arXiv preprint arXiv:2501.02519, 2025

work page arXiv 2025
[56]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

work page 2024
[57]

4real: Towards photorealistic 4d scene generation via video diffusion models.Advances in Neural Information Processing Systems, 37:45256–45280, 2024

Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Laszlo A Jeni, Sergey Tulyakov, and Hsin-Ying Lee. 4real: Towards photorealistic 4d scene generation via video diffusion models.Advances in Neural Information Processing Systems, 37:45256–45280, 2024. 13

work page 2024
[58]

Genxd: Generating any 3d and 4d scenes.arXiv preprint arXiv:2411.02319, 2024

Yuyang Zhao, Chung-Ching Lin, Kevin Lin, Zhiwen Yan, Linjie Li, Zhengyuan Yang, Jianfeng Wang, Gim Hee Lee, and Lijuan Wang. Genxd: Generating any 3d and 4d scenes.arXiv preprint arXiv:2411.02319, 2024

work page arXiv 2024
[59]

Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion.arXiv preprint arXiv:2411.04928, 2024

Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, and Yikai Wang. Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion. arXiv preprint arXiv:2411.04928, 2024

work page arXiv 2024
[60]

4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025

Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, and Ziwei Liu. 4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025

work page arXiv 2025
[61]

arXiv preprint arXiv:2511.18922 (2025)

Zhenxing Mi, Yuxin Wang, and Dan Xu. One4d: Unified 4d generation and reconstruction via decoupled lora control.arXiv preprint arXiv:2511.18922, 2025

work page arXiv 2025
[62]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 14

work page internal anchor Pith review arXiv 2017