pith. machine review for the scientific record. sign in

arxiv: 2605.12957 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

Cong Wang, Hanxin Zhu, Jiayi Luo, Peiyan Tu, Tianyu He, Xin Jin, Zhibo Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords image-to-3D generationvideo diffusion modelsgeometry-then-appearance3D world generationcross-view consistencynovel view synthesiscoarse-to-fine generation
0
0 comments X

The pith

GTA generates 3D worlds from single images by first creating coarse geometry then synthesizing appearance with separate video diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GTA as a method for turning one input image into a 3D world scene. It uses a two-stage process where the first video diffusion model produces coarse geometric structure across novel viewpoints and the second model adds fine appearance details conditioned on that geometry. The separation follows the coarse-to-fine pattern of human vision and targets the structural weaknesses and view inconsistencies that appear when models focus mainly on appearance. A random latent shuffle during training plus a test-time scaling step further stabilize cross-view appearance. Experiments show gains in fidelity, quality, and accuracy over prior image-to-3D approaches, plus the ability to improve other pipelines.

Core claim

Given a single input image, GTA adopts a two-stage framework with two dedicated video diffusion models, which first generate coarse geometric structure from novel viewpoints and then synthesize fine-grained appearance conditioned on the predicted geometry. To further enhance cross-view appearance consistency, it introduces a random latent shuffle strategy during the training process, along with a test-time scaling scheme that improves perceptual quality without compromising quantitative performance.

What carries the argument

Two sequential video diffusion models that first predict coarse geometry from novel views and then generate appearance conditioned on the predicted geometry.

If this is right

  • Synthesized 3D scenes exhibit higher structural fidelity and cross-view consistency.
  • The method outperforms prior image-to-3D approaches on fidelity, visual quality, and geometric accuracy metrics.
  • GTA functions as a plug-in enhancement that raises output quality of existing image-to-3D pipelines.
  • It supports downstream tasks in spatial intelligence, embodied intelligence, and autonomous driving.
  • Training shows favorable data efficiency compared with single-stage alternatives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future systems could treat geometry and appearance modules as independently upgradable components.
  • The same staged pipeline could be tested on text or video inputs for broader 3D generation.
  • Explicit geometry output may integrate more cleanly with physics simulators or robotics planning.
  • Data-efficiency claims invite direct measurement of training curves against single-stage baselines on fixed compute budgets.

Load-bearing premise

Separating geometry prediction from appearance synthesis in a two-stage video diffusion pipeline will reliably raise structural fidelity and view consistency without creating new inconsistencies.

What would settle it

A controlled experiment in which the two-stage GTA model produces equal or lower geometric accuracy scores and more cross-view inconsistencies than a single unified diffusion baseline on the same benchmark dataset would falsify the central claim.

read the original abstract

Recent developments in generative models and large-scale datasets have substantially advanced 3D world generation, facilitating a broad range of domains including spatial intelligence, embodied intelligence, and autonomous driving. While achieving remarkable progress, existing approaches to 3D world generation typically prioritize appearance prediction with limited modeling of the underlying geometry, leading to issues such as unreliable scene structure estimation and degraded cross-view consistency. To address these limitations, motivated by the coarse-to-fine nature of human visual perception, we propose GTA, a novel image-to-3D world generation method following a Geometry-Then-Appearance paradigm. Specifically, given a single input image, to improve the structural fidelity of synthesized 3D scenes, GTA adopts a two-stage framework with two dedicated video diffusion models, which first generate coarse geometric structure from novel viewpoints and then synthesize fine-grained appearance conditioned on the predicted geometry. To further enhance cross-view appearance consistency, we introduce a random latent shuffle strategy during the training process, along with a test-time scaling scheme that improves perceptual quality without compromising quantitative performance. Extensive experiments have demonstrated that our proposed method consistently outperforms existing approaches in terms of fidelity, visual quality, and geometric accuracy. Moreover, GTA is shown to be effective as a general enhancement module that further improves the generation quality of existing image-to-3D world pipelines, as well as supporting multiple downstream applications and exhibiting favorable data efficiency during model training, highlighting its versatility and broad applicability. Project page: https://hanxinzhu-lab.github.io/GTA/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes GTA, a two-stage image-to-3D world generation framework using dedicated video diffusion models: the first generates coarse geometric structure from novel viewpoints given a single input image, and the second synthesizes fine-grained appearance conditioned on the predicted geometry. It introduces a random latent shuffle strategy during training for cross-view consistency and a test-time scaling scheme, claiming consistent outperformance over prior methods in fidelity, visual quality, and geometric accuracy, plus utility as a general enhancement module for existing pipelines and support for downstream tasks.

Significance. If the two-stage separation reliably improves structural fidelity without error propagation, the work could advance 3D generation by better aligning with human visual perception principles, offering a versatile plug-in that enhances multiple image-to-3D pipelines while maintaining data efficiency.

major comments (2)
  1. [Experiments] Experiments section: the central claim of improved geometric accuracy and structural fidelity rests on final rendered metrics, but no independent evaluation of the geometry stage (e.g., per-view depth error, multi-view consistency scores, or 3D point cloud alignment) is reported, leaving the error-propagation risk from the first diffusion model untested.
  2. [Method] Method section (two-stage framework description): without ablations that disable the geometry stage while holding other components fixed, it remains unclear whether the coarse-to-fine separation mitigates inconsistencies or whether conditioning the appearance model on imperfect geometry predictions introduces new view-dependent artifacts.
minor comments (2)
  1. [Abstract] Abstract: the assertion of consistent outperformance lacks any mention of specific metrics, baselines, or quantitative gains, which should be summarized briefly for immediate clarity.
  2. [Figures] Figure captions and notation: the distinction between the two video diffusion models could be labeled more explicitly in diagrams to improve readability of the pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim of improved geometric accuracy and structural fidelity rests on final rendered metrics, but no independent evaluation of the geometry stage (e.g., per-view depth error, multi-view consistency scores, or 3D point cloud alignment) is reported, leaving the error-propagation risk from the first diffusion model untested.

    Authors: We appreciate the referee highlighting this point. While our reported metrics on final renderings (including depth-related and consistency measures) already reflect the downstream impact of the geometry stage, we agree that standalone evaluation of the geometry predictions would more directly address error-propagation concerns. In the revised manuscript we will add per-view depth error, multi-view consistency scores, and 3D point-cloud alignment metrics computed on the outputs of the first-stage geometry model. revision: yes

  2. Referee: [Method] Method section (two-stage framework description): without ablations that disable the geometry stage while holding other components fixed, it remains unclear whether the coarse-to-fine separation mitigates inconsistencies or whether conditioning the appearance model on imperfect geometry predictions introduces new view-dependent artifacts.

    Authors: We acknowledge that the current ablation suite does not include a controlled comparison that isolates the geometry stage. To clarify the contribution of the Geometry-Then-Appearance separation, we will add new experiments in the revision that (i) disable the geometry stage and generate appearance directly from the input image and (ii) condition the appearance model on ground-truth geometry when available. These ablations will quantify both the benefit of the two-stage design and any view-dependent artifacts arising from imperfect geometry predictions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the proposed Geometry-Then-Appearance framework

full rationale

The paper proposes an independent two-stage architectural design consisting of separate video diffusion models for coarse geometry generation followed by appearance synthesis conditioned on the geometry output. This choice is explicitly motivated by the coarse-to-fine structure of human visual perception rather than derived from any equations, fitted parameters, or self-citations that reduce the claims to inputs by construction. No load-bearing uniqueness theorems, ansatzes smuggled via prior work, or renamings of known results appear in the provided text. The central claims rest on empirical outperformance and versatility as a plug-in module, with no self-referential reductions in the derivation chain. This matches the expected honest non-finding for a standard method proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions from diffusion model literature and the coarse-to-fine perception hypothesis; no new entities or fitted parameters are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Human visual perception processes scenes in a coarse-to-fine manner
    Directly motivates the Geometry-Then-Appearance two-stage design
  • domain assumption Video diffusion models can produce consistent novel-view geometry and appearance
    Underpins the choice of dedicated diffusion models for each stage

pith-pipeline@v0.9.0 · 5589 in / 1281 out tokens · 71352 ms · 2026-05-14T19:35:05.350881+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 8 internal anchors

  1. [1]

    In: Proceedings of the First International Conference on Computer Vision Theory and Applications, pp

    Butime, J., Gutierrez, I., Corzo, L.G., Espronceda, C.F.: 3d reconstruction meth- ods, a survey. In: Proceedings of the First International Conference on Computer Vision Theory and Applications, pp. 457–463 (2006)

  2. [2]

    arXiv preprint arXiv:2401.17807 (2024)

    Li, X., Zhang, Q., Kang, D., Cheng, W., Gao, Y., Zhang, J., Liang, Z., Liao, J., Cao, Y.-P., Shan, Y.: Advances in 3d generation: A survey. arXiv preprint arXiv:2401.17807 (2024)

  3. [3]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Peng, S., Genova, K., Jiang, C., Tagliasac- chi, A., Pollefeys, M., Funkhouser, T.,et al.: Openscene: 3d scene understanding with open vocabularies. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 815–824 (2023)

  4. [4]

    Artificial Intelligence Review56(9), 9175–9219 (2023)

    Samavati, T., Soryani, M.: Deep learning- based 3d reconstruction: a survey. Artificial Intelligence Review56(9), 9175–9219 (2023)

  5. [5]

    arXiv preprint arXiv:2505.05474 (2025)

    Wen, B., Xie, H., Chen, Z., Hong, F., Liu, Z.: 3d scene generation: A survey. arXiv preprint arXiv:2505.05474 (2025)

  6. [6]

    International Journal of Computer Vision112(2), 188–203 (2015)

    Zia, M.Z., Stark, M., Schindler, K.: Towards scene understanding with detailed 3d object representations. International Journal of Computer Vision112(2), 188–203 (2015)

  7. [7]

    International Journal of Computer Vision132(10), 4456–4472 (2024) 20

    Li, M., Zhou, P., Liu, J.-W., Keppo, J., Lin, M., Yan, S., Xu, X.: Instant3d: Instant text- to-3d generation. International Journal of Computer Vision132(10), 4456–4472 (2024) 20

  8. [8]

    Interna- tional Journal of Computer Vision128(10), 2534–2551 (2020)

    Moschoglou, S., Ploumpis, S., Nicolaou, M.A., Papaioannou, A., Zafeiriou, S.: 3dface- gan: Adversarial nets for 3d face represen- tation, generation, and translation. Interna- tional Journal of Computer Vision128(10), 2534–2551 (2020)

  9. [9]

    Interna- tional Journal of Computer Vision133(5), 2886–2909 (2025)

    Di, D., Yang, J., Luo, C., Xue, Z., Chen, W., Yang, X., Gao, Y.: Hyper-3dg: Text-to-3d gaussian generation via hypergraph. Interna- tional Journal of Computer Vision133(5), 2886–2909 (2025)

  10. [10]

    Interna- tional Journal of Computer Vision131(11), 2816–2844 (2023)

    Song, W., Zhang, X., Guo, Y., Li, S., Hao, A., Qin, H.: Automatic generation of 3d scene animation based on dynamic knowledge graphs and contextual encoding. Interna- tional Journal of Computer Vision131(11), 2816–2844 (2023)

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Yang, Y., Sun, F.-Y., Weihs, L., VanderBilt, E., Herrasti, A., Han, W., Wu, J., Haber, N., Krishna, R., Liu, L.,et al.: Holodeck: Language guided generation of 3d embod- ied ai environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16227–16237 (2024)

  12. [12]

    arXiv preprint arXiv:2506.10600 (2025)

    Wang, X., Liu, L., Cao, Y., Wu, R., Qin, W., Wang, D., Sui, W., Su, Z.: Embodied- gen: Towards a generative 3d world engine for embodied intelligence. arXiv preprint arXiv:2506.10600 (2025)

  13. [13]

    arXiv preprint arXiv:2509.07996 (2025)

    Kong, L., Yang, W., Mei, J., Liu, Y., Liang, A., Zhu, D., Lu, D., Yin, W., Hu, X., Jia, M., et al.: 3d and 4d world modeling: A survey. arXiv preprint arXiv:2509.07996 (2025)

  14. [14]

    ACM Computing Surveys 58(3), 1–38 (2025)

    Ding, J., Zhang, Y., Shang, Y., Zhang, Y., Zong, Z., Feng, J., Yuan, Y., Su, H., Li, N., Sukiennik, N.,et al.: Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys 58(3), 1–38 (2025)

  15. [15]

    In: European Conference on Computer Vision, pp

    Zheng, W., Chen, W., Huang, Y., Zhang, B., Duan, Y., Lu, J.: Occworld: Learning a 3d occupancy world model for autonomous driv- ing. In: European Conference on Computer Vision, pp. 55–72 (2024). Springer

  16. [16]

    DreamFusion: Text-to-3D using 2D Diffusion

    Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffu- sion. arXiv preprint arXiv:2209.14988 (2022)

  17. [17]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Lin, C.-H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.-Y., Lin, T.-Y.: Magic3d: High-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309 (2023)

  18. [18]

    In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pp

    Raj, A., Kaza, S., Poole, B., Niemeyer, M., Ruiz, N., Mildenhall, B., Zada, S., Aberman, K., Rubinstein, M., Barron, J.,et al.: Dream- booth3d: Subject-driven text-to-3d genera- tion. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pp. 2349–2359 (2023)

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chain- ing: Lifting pretrained 2d diffusion models for 3d generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12619–12629 (2023)

  20. [20]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022)

  21. [21]

    In: European Conference on Com- puter Vision, pp

    Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.-F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: European Conference on Com- puter Vision, pp. 393–411 (2024). Springer

  22. [22]

    In: SIG- GRAPH Asia 2024 Conference Papers, pp

    Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A.,et al.: Lumiere: A space-time diffusion model for video generation. In: SIG- GRAPH Asia 2024 Conference Papers, pp. 1–11 (2024)

  23. [23]

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y.: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371 (2024) 21

  24. [24]

    arXiv preprint arXiv:2307.06942 (2023)

    Wang, Y., He, Y., Li, Y., Li, K., Yu, J., Ma, X., Li, X., Chen, G., Chen, X., Wang, Y., et al.: Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023)

  25. [25]

    Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion.arXiv preprint arXiv:2411.04928, 2024

    Sun, W., Chen, S., Liu, F., Chen, Z., Duan, Y., Zhang, J., Wang, Y.: Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion. arXiv preprint arXiv:2411.04928 (2024)

  26. [26]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Ac3d: Analyzing and improving 3d camera control in video dif- fusion transformers. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 22875–22889 (2025)

  27. [27]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    Liang, H., Cao, J., Goel, V., Qian, G., Korolev, S., Terzopoulos, D., Plataniotis, K.N., Tulyakov, S., Ren, J.: Wonderland: Navigating 3d scenes from a single image. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 798–810 (2025)

  28. [28]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.-T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024)

  29. [29]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Zhang, S., Xu, H., Guo, S., Xie, Z., Bao, H., Xu, W., Zou, C.: Spatialcrafter: Unleashing the imagination of video diffu- sion models for scene reconstruction from limited observations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 27794–27805 (2025)

  30. [30]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101 (2024)

  31. [31]

    In: ACM SIGGRAPH 2024 Conference Papers, pp

    Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Motionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers, pp. 1–11 (2024)

  32. [32]

    ACM Transactions on Graphics (TOG)44(6), 1–15 (2025)

    Huang, T., Zheng, W., Wang, T., Liu, Y., Wang, Z., Wu, J., Jiang, J., Li, H., Lau, R., Zuo, W.,et al.: Voyager: Long- range and world-consistent video diffusion for explorable 3d scene generation. ACM Transactions on Graphics (TOG)44(6), 1–15 (2025)

  33. [33]

    arXiv preprint arXiv:2509.21657 (2025)

    Dai, Y., Jiang, F., Wang, C., Xu, M., Qi, Y.: Fantasyworld: Geometry-consistent world modeling via unified video and 3d prediction. arXiv preprint arXiv:2509.21657 (2025)

  34. [34]

    arXiv preprint arXiv:2511.23127 (2025)

    Zhang, H., Chen, K., Zhang, Z., Chen, H.H., Lyu, Y., Zhang, Y., Yang, S., Zhou, K., Chen, Y.: Dualcamctrl: Dual-branch dif- fusion model for geometry-aware camera- controlled video generation. arXiv preprint arXiv:2511.23127 (2025)

  35. [35]

    Cognitive psychology9(3), 353–383 (1977)

    Navon, D.: Forest before trees: The prece- dence of global features in visual perception. Cognitive psychology9(3), 353–383 (1977)

  36. [36]

    Journal of the Opti- cal Society of America A4(10), 2006–2021 (1987)

    Watt, R.: Scanning from coarse to fine spa- tial scales in the human visual system after the onset of a stimulus. Journal of the Opti- cal Society of America A4(10), 2006–2021 (1987)

  37. [37]

    ACM Computing Surveys57(2), 1–42 (2024)

    Xing, Z., Feng, Q., Chen, H., Dai, Q., Hu, H., Xu, H., Wu, Z., Jiang, Y.-G.: A survey on video diffusion models. ACM Computing Surveys57(2), 1–42 (2024)

  38. [38]

    ACM Computing Surveys58(6), 1–35 (2025)

    Waseem, F., Shahzad, M.: Video is worth a thousand images: Exploring the latest trends in long video generation. ACM Computing Surveys58(6), 1–35 (2025)

  39. [39]

    Artificial Intelligence Review58(11), 338 (2025)

    Ma, W., Yang, X., Jiao, L., Li, L., Liu, X., Liu, F., Chen, P., Yang, Y., Ma, M., Sun, L.,et al.: Video diffusion generation: compre- hensive review and open problems. Artificial Intelligence Review58(11), 338 (2025)

  40. [40]

    In: Proceedings of the IEEE Conference on Computer Vision and 22 Pattern Recognition, pp

    Tulyakov, S., Liu, M.-Y., Yang, X., Kautz, J.: Mocogan: Decomposing motion and content for video generation. In: Proceedings of the IEEE Conference on Computer Vision and 22 Pattern Recognition, pp. 1526–1535 (2018)

  41. [41]

    In: Proceed- ings of the AAAI Conference on Artificial Intelligence, vol

    Li, Y., Min, M., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: Proceed- ings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

  42. [42]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text- to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)

  43. [43]

    Advances in Neural Information Processing Systems37, 29489– 29513 (2024)

    Tian, Y., Yang, L., Yang, H., Gao, Y., Deng, Y., Chen, J., Wang, X., Yu, Z., Tao, X., Wan, P.,et al.: Videotetris: Towards compositional text-to-video generation. Advances in Neural Information Processing Systems37, 29489– 29513 (2024)

  44. [44]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Chen, S., Xu, M., Ren, J., Cong, Y., He, S., Xie, Y., Sinha, A., Luo, P., Xiang, T., Perez-Rua, J.-M.: Gentron: Diffusion trans- formers for image and video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6441–6451 (2024)

  45. [45]

    International Journal of Computer Vision 133(5), 3059–3078 (2025)

    Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P.,et al.: Lavie: High-quality video genera- tion with cascaded latent diffusion models. International Journal of Computer Vision 133(5), 3059–3078 (2025)

  46. [46]

    In: Proceed- ings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pp

    Zeng, Y., Wei, G., Zheng, J., Zou, J., Wei, Y., Zhang, Y., Li, H.: Make pixels dance: High-dynamic video generation. In: Proceed- ings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pp. 8850–8860 (2024)

  47. [47]

    arXiv preprint arXiv:2507.16869 (2025)

    Ma, Y., Feng, K., Hu, Z., Wang, X., Wang, Y., Zheng, M., He, X., Zhu, C., Liu, H., He, Y., et al.: Controllable video generation: A survey. arXiv preprint arXiv:2507.16869 (2025)

  48. [48]

    In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp

    Hao, Z., Huang, X., Belongie, S.: Controllable video generation with sparse trajectories. In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp. 7854–7863 (2018)

  49. [49]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Hu, Y., Luo, C., Chen, Z.: Make it move: controllable image-to-video generation with text descriptions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18219–18228 (2022)

  50. [50]

    In: ACM SIG- GRAPH 2024 Conference Papers, pp

    Shi, X., Huang, Z., Wang, F.-Y., Bian, W., Li, D., Zhang, Y., Zhang, M., Cheung, K.C., See, S., Qin, H.,et al.: Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. In: ACM SIG- GRAPH 2024 Conference Papers, pp. 1–11 (2024)

  51. [51]

    arXiv preprint arXiv:2403.16407 (2024)

    Li, C., Huang, D., Lu, Z., Xiao, Y., Pei, Q., Bai, L.: A survey on long video generation: Challenges, methods, and prospects. arXiv preprint arXiv:2403.16407 (2024)

  52. [52]

    In: European Conference on Computer Vision, pp

    Ge, S., Hayes, T., Yang, H., Yin, X., Pang, G., Jacobs, D., Huang, J.-B., Parikh, D.: Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, pp. 102–118 (2022). Springer

  53. [53]

    In: Proceedings of the Computer Vision and Pattern Recog- nition Conference, pp

    Henschel, R., Khachatryan, L., Poghosyan, H., Hayrapetyan, D., Tadevosyan, V., Wang, Z., Navasardyan, S., Shi, H.: Streamingt2v: Consistent, dynamic, and extendable long video generation from text. In: Proceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 2568–2577 (2025)

  54. [54]

    Advances in Neural Information Processing Systems 37, 131434–131455 (2024)

    Lu, Y., Liang, Y., Zhu, L., Yang, Y.: Free- long: Training-free long video generation with spectralblend temporal attention. Advances in Neural Information Processing Systems 37, 131434–131455 (2024)

  55. [55]

    arXiv preprint arXiv:2509.01085 (2025)

    Zhan, C., Li, W., Shen, C., Zhang, J., Wu, S., Zhang, H.: Bidirectional sparse attention for faster video diffusion training. arXiv preprint arXiv:2509.01085 (2025)

  56. [56]

    In: Proceedings of the IEEE/CVF Conference on Computer 23 Vision and Pattern Recognition, pp

    Xing, Z., Dai, Q., Hu, H., Wu, Z., Jiang, Y.-G.: Simda: Simple diffusion adapter for efficient video generation. In: Proceedings of the IEEE/CVF Conference on Computer 23 Vision and Pattern Recognition, pp. 7827– 7839 (2024)

  57. [57]

    arXiv preprint arXiv:2307.14073 (2023)

    Hu, Z., Xu, D.: Videocontrolnet: A motion- guided video-to-video translation framework by using diffusion model with controlnet. arXiv preprint arXiv:2307.14073 (2023)

  58. [58]

    arXiv preprint arXiv:2508.21058 (2025)

    Cai, S., Yang, C., Zhang, L., Guo, Y., Xiao, J., Yang, Z., Xu, Y., Yang, Z., Yuille, A., Guibas, L., et al.: Mixture of contexts for long video generation. arXiv preprint arXiv:2508.21058 (2025)

  59. [59]

    Advances in neural information processing systems36, 8406– 8441 (2023)

    Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with vari- ational score distillation. Advances in neural information processing systems36, 8406– 8441 (2023)

  60. [60]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Xu, D., Jiang, Y., Wang, P., Fan, Z., Wang, Y., Wang, Z.: Neurallift-360: Lifting an in- the-wild 2d photo to a 3d object with 360deg views. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4479–4489 (2023)

  61. [61]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Tang, J., Wang, T., Zhang, B., Zhang, T., Yi, R., Ma, L., Chen, D.: Make-it- 3d: High-fidelity 3d creation from a single image with diffusion prior. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22819–22829 (2023)

  62. [62]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Obja- verse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13142–13153 (2023)

  63. [63]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and ver- satile 3d generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 21469–21480 (2025)

  64. [64]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to- 3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298– 9309 (2023)

  65. [65]

    Advances in Neural Information Processing Systems36, 22226– 22246 (2023)

    Liu, M., Xu, C., Jin, H., Chen, L., Varma T, M., Xu, Z., Su, H.: One-2-3-45: Any sin- gle image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems36, 22226– 22246 (2023)

  66. [66]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Liu, M., Shi, R., Chen, L., Zhang, Z., Xu, C., Wei, X., Chen, H., Zeng, C., Gu, J., Su, H.: One-2-3-45++: Fast single image to 3d objects with consistent multi-view gener- ation and 3d diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10072–10083 (2024)

  67. [67]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Long, X., Guo, Y.-C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.-H., Habermann, M., Theobalt, C.,et al.: Wonder3d: Single image to 3d using cross-domain diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9970–9980 (2024)

  68. [68]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp

    Yu, H.-X., Duan, H., Hur, J., Sargent, K., Rubinstein, M., Freeman, W.T., Cole, F., Sun, D., Snavely, N., Wu, J.,et al.: Wonder- journey: Going from anywhere to everywhere. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 6658–6667 (2024)

  69. [69]

    Lu- ciddreamer: Domain-free generation of 3d gaussian splatting scenes, 2023

    Chung, J., Lee, S., Nam, H., Lee, J., Lee, K.M.: Luciddreamer: Domain-free genera- tion of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384 (2023)

  70. [70]

    In: Proceedings of the Computer Vision and Pat- tern Recognition Conference, pp

    Yu, H.-X., Duan, H., Herrmann, C., Free- man, W.T., Wu, J.: Wonderworld: Interactive 3d scene generation from a single image. In: Proceedings of the Computer Vision and Pat- tern Recognition Conference, pp. 5916–5926 (2025)

  71. [71]

    arXiv preprint arXiv:2504.02261 (2025)

    Ni, C., Wang, X., Zhu, Z., Wang, W., Li, H., Zhao, G., Li, J., Qin, W., Huang, G., 24 Mei, W.: Wonderturbo: Generating interac- tive 3d world in 0.72 seconds. arXiv preprint arXiv:2504.02261 (2025)

  72. [72]

    arXiv preprint arXiv:2503.13265 (2025)

    Chen, L., Zhou, Z., Zhao, M., Wang, Y., Zhang, G., Huang, W., Sun, H., Wen, J.-R., Li, C.: Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis. arXiv preprint arXiv:2503.13265 (2025)

  73. [73]

    In: 2025 International Conference on 3D Vision (3DV), pp

    Popov, S., Raj, A., Krainin, M., Li, Y., Freeman, W.T., Rubinstein, M.: Camctrl3d: Single-image scene exploration with precise 3d camera control. In: 2025 International Conference on 3D Vision (3DV), pp. 649–658 (2025). IEEE

  74. [74]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    Ren, X., Shen, T., Huang, J., Ling, H., Lu, Y., Nimier-David, M., M¨ uller, T., Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with pre- cise camera control. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 6121–6132 (2025)

  75. [75]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20697–20709 (2024)

  76. [76]

    Advances in neural information processing systems33, 6840– 6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffu- sion probabilistic models. Advances in neural information processing systems33, 6840– 6851 (2020)

  77. [77]

    Advances in neural information processing systems35, 8633–8646 (2022)

    Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffu- sion models. Advances in neural information processing systems35, 8633–8646 (2022)

  78. [78]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

  79. [79]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    Ma, B., Gao, H., Deng, H., Luo, Z., Huang, T., Tang, L., Wang, X.: You see it, you got it: Learning 3d creation on pose-free videos at scale. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 2016–2029 (2025)

  80. [80]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22160–22169 (2024)

Showing first 80 references.