arxiv: 2605.12957 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

Cong Wang, Hanxin Zhu, Jiayi Luo, Peiyan Tu, Tianyu He, Xin Jin, Zhibo Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords image-to-3D generationvideo diffusion modelsgeometry-then-appearance3D world generationcross-view consistencynovel view synthesiscoarse-to-fine generation

0 comments

The pith

GTA generates 3D worlds from single images by first creating coarse geometry then synthesizing appearance with separate video diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GTA as a method for turning one input image into a 3D world scene. It uses a two-stage process where the first video diffusion model produces coarse geometric structure across novel viewpoints and the second model adds fine appearance details conditioned on that geometry. The separation follows the coarse-to-fine pattern of human vision and targets the structural weaknesses and view inconsistencies that appear when models focus mainly on appearance. A random latent shuffle during training plus a test-time scaling step further stabilize cross-view appearance. Experiments show gains in fidelity, quality, and accuracy over prior image-to-3D approaches, plus the ability to improve other pipelines.

Core claim

Given a single input image, GTA adopts a two-stage framework with two dedicated video diffusion models, which first generate coarse geometric structure from novel viewpoints and then synthesize fine-grained appearance conditioned on the predicted geometry. To further enhance cross-view appearance consistency, it introduces a random latent shuffle strategy during the training process, along with a test-time scaling scheme that improves perceptual quality without compromising quantitative performance.

What carries the argument

Two sequential video diffusion models that first predict coarse geometry from novel views and then generate appearance conditioned on the predicted geometry.

If this is right

Synthesized 3D scenes exhibit higher structural fidelity and cross-view consistency.
The method outperforms prior image-to-3D approaches on fidelity, visual quality, and geometric accuracy metrics.
GTA functions as a plug-in enhancement that raises output quality of existing image-to-3D pipelines.
It supports downstream tasks in spatial intelligence, embodied intelligence, and autonomous driving.
Training shows favorable data efficiency compared with single-stage alternatives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future systems could treat geometry and appearance modules as independently upgradable components.
The same staged pipeline could be tested on text or video inputs for broader 3D generation.
Explicit geometry output may integrate more cleanly with physics simulators or robotics planning.
Data-efficiency claims invite direct measurement of training curves against single-stage baselines on fixed compute budgets.

Load-bearing premise

Separating geometry prediction from appearance synthesis in a two-stage video diffusion pipeline will reliably raise structural fidelity and view consistency without creating new inconsistencies.

What would settle it

A controlled experiment in which the two-stage GTA model produces equal or lower geometric accuracy scores and more cross-view inconsistencies than a single unified diffusion baseline on the same benchmark dataset would falsify the central claim.

read the original abstract

Recent developments in generative models and large-scale datasets have substantially advanced 3D world generation, facilitating a broad range of domains including spatial intelligence, embodied intelligence, and autonomous driving. While achieving remarkable progress, existing approaches to 3D world generation typically prioritize appearance prediction with limited modeling of the underlying geometry, leading to issues such as unreliable scene structure estimation and degraded cross-view consistency. To address these limitations, motivated by the coarse-to-fine nature of human visual perception, we propose GTA, a novel image-to-3D world generation method following a Geometry-Then-Appearance paradigm. Specifically, given a single input image, to improve the structural fidelity of synthesized 3D scenes, GTA adopts a two-stage framework with two dedicated video diffusion models, which first generate coarse geometric structure from novel viewpoints and then synthesize fine-grained appearance conditioned on the predicted geometry. To further enhance cross-view appearance consistency, we introduce a random latent shuffle strategy during the training process, along with a test-time scaling scheme that improves perceptual quality without compromising quantitative performance. Extensive experiments have demonstrated that our proposed method consistently outperforms existing approaches in terms of fidelity, visual quality, and geometric accuracy. Moreover, GTA is shown to be effective as a general enhancement module that further improves the generation quality of existing image-to-3D world pipelines, as well as supporting multiple downstream applications and exhibiting favorable data efficiency during model training, highlighting its versatility and broad applicability. Project page: https://hanxinzhu-lab.github.io/GTA/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GTA splits image-to-3D into separate geometry and appearance video diffusion stages, which is a clean idea but the geometry output is not measured on its own.

read the letter

GTA splits image-to-3D generation into a two-stage video diffusion process. The first model produces coarse geometry from novel viewpoints, and the second model adds appearance details conditioned on that geometry. Random latent shuffling during training and a test-time scaling step are added to support cross-view consistency and perceptual quality. The core claim is that this separation improves structural fidelity and consistency over single-stage baselines, and the method can also boost other existing pipelines while training efficiently on limited data. The motivation from coarse-to-fine human perception is straightforward and the framework is presented without unnecessary complexity. The experiments are described as showing gains in fidelity, visual quality, and geometric accuracy, which aligns with the practical needs of applications like driving simulation. The main gap is that the geometry stage is not evaluated independently. No separate metrics for depth accuracy, multi-view consistency, or point-cloud alignment are reported, and there are no ablations that replace the predicted geometry with ground truth to test whether the second stage actually benefits from or suffers from the first stage's output. If the geometry contains view-dependent errors, the appearance model could simply propagate them rather than correct them. The latent shuffle and scaling look like useful engineering additions, but they do not address this isolation issue. This work is aimed at researchers building generative 3D models for spatial intelligence or robotics. It is worth sending to peer review because the two-stage separation is distinct from prior single-model approaches and the application claims are concrete, though reviewers will likely request clearer geometry-specific measurements and targeted ablations.

Referee Report

2 major / 2 minor

Summary. The paper proposes GTA, a two-stage image-to-3D world generation framework using dedicated video diffusion models: the first generates coarse geometric structure from novel viewpoints given a single input image, and the second synthesizes fine-grained appearance conditioned on the predicted geometry. It introduces a random latent shuffle strategy during training for cross-view consistency and a test-time scaling scheme, claiming consistent outperformance over prior methods in fidelity, visual quality, and geometric accuracy, plus utility as a general enhancement module for existing pipelines and support for downstream tasks.

Significance. If the two-stage separation reliably improves structural fidelity without error propagation, the work could advance 3D generation by better aligning with human visual perception principles, offering a versatile plug-in that enhances multiple image-to-3D pipelines while maintaining data efficiency.

major comments (2)

[Experiments] Experiments section: the central claim of improved geometric accuracy and structural fidelity rests on final rendered metrics, but no independent evaluation of the geometry stage (e.g., per-view depth error, multi-view consistency scores, or 3D point cloud alignment) is reported, leaving the error-propagation risk from the first diffusion model untested.
[Method] Method section (two-stage framework description): without ablations that disable the geometry stage while holding other components fixed, it remains unclear whether the coarse-to-fine separation mitigates inconsistencies or whether conditioning the appearance model on imperfect geometry predictions introduces new view-dependent artifacts.

minor comments (2)

[Abstract] Abstract: the assertion of consistent outperformance lacks any mention of specific metrics, baselines, or quantitative gains, which should be summarized briefly for immediate clarity.
[Figures] Figure captions and notation: the distinction between the two video diffusion models could be labeled more explicitly in diagrams to improve readability of the pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim of improved geometric accuracy and structural fidelity rests on final rendered metrics, but no independent evaluation of the geometry stage (e.g., per-view depth error, multi-view consistency scores, or 3D point cloud alignment) is reported, leaving the error-propagation risk from the first diffusion model untested.

Authors: We appreciate the referee highlighting this point. While our reported metrics on final renderings (including depth-related and consistency measures) already reflect the downstream impact of the geometry stage, we agree that standalone evaluation of the geometry predictions would more directly address error-propagation concerns. In the revised manuscript we will add per-view depth error, multi-view consistency scores, and 3D point-cloud alignment metrics computed on the outputs of the first-stage geometry model. revision: yes
Referee: [Method] Method section (two-stage framework description): without ablations that disable the geometry stage while holding other components fixed, it remains unclear whether the coarse-to-fine separation mitigates inconsistencies or whether conditioning the appearance model on imperfect geometry predictions introduces new view-dependent artifacts.

Authors: We acknowledge that the current ablation suite does not include a controlled comparison that isolates the geometry stage. To clarify the contribution of the Geometry-Then-Appearance separation, we will add new experiments in the revision that (i) disable the geometry stage and generate appearance directly from the input image and (ii) condition the appearance model on ground-truth geometry when available. These ablations will quantify both the benefit of the two-stage design and any view-dependent artifacts arising from imperfect geometry predictions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the proposed Geometry-Then-Appearance framework

full rationale

The paper proposes an independent two-stage architectural design consisting of separate video diffusion models for coarse geometry generation followed by appearance synthesis conditioned on the geometry output. This choice is explicitly motivated by the coarse-to-fine structure of human visual perception rather than derived from any equations, fitted parameters, or self-citations that reduce the claims to inputs by construction. No load-bearing uniqueness theorems, ansatzes smuggled via prior work, or renamings of known results appear in the provided text. The central claims rest on empirical outperformance and versatility as a plug-in module, with no self-referential reductions in the derivation chain. This matches the expected honest non-finding for a standard method proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions from diffusion model literature and the coarse-to-fine perception hypothesis; no new entities or fitted parameters are explicitly introduced in the abstract.

axioms (2)

domain assumption Human visual perception processes scenes in a coarse-to-fine manner
Directly motivates the Geometry-Then-Appearance two-stage design
domain assumption Video diffusion models can produce consistent novel-view geometry and appearance
Underpins the choice of dedicated diffusion models for each stage

pith-pipeline@v0.9.0 · 5589 in / 1281 out tokens · 71352 ms · 2026-05-14T19:35:05.350881+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
GTA adopts a two-stage framework with two dedicated video diffusion models, which first generate coarse geometric structure from novel viewpoints and then synthesize fine-grained appearance conditioned on the predicted geometry.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
motivated by the coarse-to-fine nature of human visual perception

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 8 internal anchors

[1]

In: Proceedings of the First International Conference on Computer Vision Theory and Applications, pp

Butime, J., Gutierrez, I., Corzo, L.G., Espronceda, C.F.: 3d reconstruction meth- ods, a survey. In: Proceedings of the First International Conference on Computer Vision Theory and Applications, pp. 457–463 (2006)

work page 2006
[2]

arXiv preprint arXiv:2401.17807 (2024)

Li, X., Zhang, Q., Kang, D., Cheng, W., Gao, Y., Zhang, J., Liang, Z., Liao, J., Cao, Y.-P., Shan, Y.: Advances in 3d generation: A survey. arXiv preprint arXiv:2401.17807 (2024)

work page arXiv 2024
[3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Peng, S., Genova, K., Jiang, C., Tagliasac- chi, A., Pollefeys, M., Funkhouser, T.,et al.: Openscene: 3d scene understanding with open vocabularies. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 815–824 (2023)

work page 2023
[4]

Artificial Intelligence Review56(9), 9175–9219 (2023)

Samavati, T., Soryani, M.: Deep learning- based 3d reconstruction: a survey. Artificial Intelligence Review56(9), 9175–9219 (2023)

work page 2023
[5]

arXiv preprint arXiv:2505.05474 (2025)

Wen, B., Xie, H., Chen, Z., Hong, F., Liu, Z.: 3d scene generation: A survey. arXiv preprint arXiv:2505.05474 (2025)

work page arXiv 2025
[6]

International Journal of Computer Vision112(2), 188–203 (2015)

Zia, M.Z., Stark, M., Schindler, K.: Towards scene understanding with detailed 3d object representations. International Journal of Computer Vision112(2), 188–203 (2015)

work page 2015
[7]

International Journal of Computer Vision132(10), 4456–4472 (2024) 20

Li, M., Zhou, P., Liu, J.-W., Keppo, J., Lin, M., Yan, S., Xu, X.: Instant3d: Instant text- to-3d generation. International Journal of Computer Vision132(10), 4456–4472 (2024) 20

work page 2024
[8]

Interna- tional Journal of Computer Vision128(10), 2534–2551 (2020)

Moschoglou, S., Ploumpis, S., Nicolaou, M.A., Papaioannou, A., Zafeiriou, S.: 3dface- gan: Adversarial nets for 3d face represen- tation, generation, and translation. Interna- tional Journal of Computer Vision128(10), 2534–2551 (2020)

work page 2020
[9]

Interna- tional Journal of Computer Vision133(5), 2886–2909 (2025)

Di, D., Yang, J., Luo, C., Xue, Z., Chen, W., Yang, X., Gao, Y.: Hyper-3dg: Text-to-3d gaussian generation via hypergraph. Interna- tional Journal of Computer Vision133(5), 2886–2909 (2025)

work page 2025
[10]

Interna- tional Journal of Computer Vision131(11), 2816–2844 (2023)

Song, W., Zhang, X., Guo, Y., Li, S., Hao, A., Qin, H.: Automatic generation of 3d scene animation based on dynamic knowledge graphs and contextual encoding. Interna- tional Journal of Computer Vision131(11), 2816–2844 (2023)

work page 2023
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Yang, Y., Sun, F.-Y., Weihs, L., VanderBilt, E., Herrasti, A., Han, W., Wu, J., Haber, N., Krishna, R., Liu, L.,et al.: Holodeck: Language guided generation of 3d embod- ied ai environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16227–16237 (2024)

work page 2024
[12]

arXiv preprint arXiv:2506.10600 (2025)

Wang, X., Liu, L., Cao, Y., Wu, R., Qin, W., Wang, D., Sui, W., Su, Z.: Embodied- gen: Towards a generative 3d world engine for embodied intelligence. arXiv preprint arXiv:2506.10600 (2025)

work page arXiv 2025
[13]

arXiv preprint arXiv:2509.07996 (2025)

Kong, L., Yang, W., Mei, J., Liu, Y., Liang, A., Zhu, D., Lu, D., Yin, W., Hu, X., Jia, M., et al.: 3d and 4d world modeling: A survey. arXiv preprint arXiv:2509.07996 (2025)

work page arXiv 2025
[14]

ACM Computing Surveys 58(3), 1–38 (2025)

Ding, J., Zhang, Y., Shang, Y., Zhang, Y., Zong, Z., Feng, J., Yuan, Y., Su, H., Li, N., Sukiennik, N.,et al.: Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys 58(3), 1–38 (2025)

work page 2025
[15]

In: European Conference on Computer Vision, pp

Zheng, W., Chen, W., Huang, Y., Zhang, B., Duan, Y., Lu, J.: Occworld: Learning a 3d occupancy world model for autonomous driv- ing. In: European Conference on Computer Vision, pp. 55–72 (2024). Springer

work page 2024
[16]

DreamFusion: Text-to-3D using 2D Diffusion

Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffu- sion. arXiv preprint arXiv:2209.14988 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Lin, C.-H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.-Y., Lin, T.-Y.: Magic3d: High-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309 (2023)

work page 2023
[18]

In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pp

Raj, A., Kaza, S., Poole, B., Niemeyer, M., Ruiz, N., Mildenhall, B., Zada, S., Aberman, K., Rubinstein, M., Barron, J.,et al.: Dream- booth3d: Subject-driven text-to-3d genera- tion. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pp. 2349–2359 (2023)

work page 2023
[19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chain- ing: Lifting pretrained 2d diffusion models for 3d generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12619–12629 (2023)

work page 2023
[20]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

In: European Conference on Com- puter Vision, pp

Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Li, F.-F., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. In: European Conference on Com- puter Vision, pp. 393–411 (2024). Springer

work page 2024
[22]

In: SIG- GRAPH Asia 2024 Conference Papers, pp

Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A.,et al.: Lumiere: A space-time diffusion model for video generation. In: SIG- GRAPH Asia 2024 Conference Papers, pp. 1–11 (2024)

work page 2024
[23]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y.: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371 (2024) 21

work page internal anchor Pith review arXiv 2024
[24]

arXiv preprint arXiv:2307.06942 (2023)

Wang, Y., He, Y., Li, Y., Li, K., Yu, J., Ma, X., Li, X., Chen, G., Chen, X., Wang, Y., et al.: Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023)

work page arXiv 2023
[25]

Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion.arXiv preprint arXiv:2411.04928, 2024

Sun, W., Chen, S., Liu, F., Chen, Z., Duan, Y., Zhang, J., Wang, Y.: Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion. arXiv preprint arXiv:2411.04928 (2024)

work page arXiv 2024
[26]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Ac3d: Analyzing and improving 3d camera control in video dif- fusion transformers. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 22875–22889 (2025)

work page 2025
[27]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Liang, H., Cao, J., Goel, V., Qian, G., Korolev, S., Terzopoulos, D., Plataniotis, K.N., Tulyakov, S., Ren, J.: Wonderland: Navigating 3d scenes from a single image. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 798–810 (2025)

work page 2025
[28]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.-T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Zhang, S., Xu, H., Guo, S., Xie, Z., Bao, H., Xu, W., Zou, C.: Spatialcrafter: Unleashing the imagination of video diffu- sion models for scene reconstruction from limited observations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 27794–27805 (2025)

work page 2025
[30]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

In: ACM SIGGRAPH 2024 Conference Papers, pp

Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Motionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers, pp. 1–11 (2024)

work page 2024
[32]

ACM Transactions on Graphics (TOG)44(6), 1–15 (2025)

Huang, T., Zheng, W., Wang, T., Liu, Y., Wang, Z., Wu, J., Jiang, J., Li, H., Lau, R., Zuo, W.,et al.: Voyager: Long- range and world-consistent video diffusion for explorable 3d scene generation. ACM Transactions on Graphics (TOG)44(6), 1–15 (2025)

work page 2025
[33]

arXiv preprint arXiv:2509.21657 (2025)

Dai, Y., Jiang, F., Wang, C., Xu, M., Qi, Y.: Fantasyworld: Geometry-consistent world modeling via unified video and 3d prediction. arXiv preprint arXiv:2509.21657 (2025)

work page arXiv 2025
[34]

arXiv preprint arXiv:2511.23127 (2025)

Zhang, H., Chen, K., Zhang, Z., Chen, H.H., Lyu, Y., Zhang, Y., Yang, S., Zhou, K., Chen, Y.: Dualcamctrl: Dual-branch dif- fusion model for geometry-aware camera- controlled video generation. arXiv preprint arXiv:2511.23127 (2025)

work page arXiv 2025
[35]

Cognitive psychology9(3), 353–383 (1977)

Navon, D.: Forest before trees: The prece- dence of global features in visual perception. Cognitive psychology9(3), 353–383 (1977)

work page 1977
[36]

Journal of the Opti- cal Society of America A4(10), 2006–2021 (1987)

Watt, R.: Scanning from coarse to fine spa- tial scales in the human visual system after the onset of a stimulus. Journal of the Opti- cal Society of America A4(10), 2006–2021 (1987)

work page 2006
[37]

ACM Computing Surveys57(2), 1–42 (2024)

Xing, Z., Feng, Q., Chen, H., Dai, Q., Hu, H., Xu, H., Wu, Z., Jiang, Y.-G.: A survey on video diffusion models. ACM Computing Surveys57(2), 1–42 (2024)

work page 2024
[38]

ACM Computing Surveys58(6), 1–35 (2025)

Waseem, F., Shahzad, M.: Video is worth a thousand images: Exploring the latest trends in long video generation. ACM Computing Surveys58(6), 1–35 (2025)

work page 2025
[39]

Artificial Intelligence Review58(11), 338 (2025)

Ma, W., Yang, X., Jiao, L., Li, L., Liu, X., Liu, F., Chen, P., Yang, Y., Ma, M., Sun, L.,et al.: Video diffusion generation: compre- hensive review and open problems. Artificial Intelligence Review58(11), 338 (2025)

work page 2025
[40]

In: Proceedings of the IEEE Conference on Computer Vision and 22 Pattern Recognition, pp

Tulyakov, S., Liu, M.-Y., Yang, X., Kautz, J.: Mocogan: Decomposing motion and content for video generation. In: Proceedings of the IEEE Conference on Computer Vision and 22 Pattern Recognition, pp. 1526–1535 (2018)

work page 2018
[41]

In: Proceed- ings of the AAAI Conference on Artificial Intelligence, vol

Li, Y., Min, M., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: Proceed- ings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

work page 2018
[42]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text- to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Advances in Neural Information Processing Systems37, 29489– 29513 (2024)

Tian, Y., Yang, L., Yang, H., Gao, Y., Deng, Y., Chen, J., Wang, X., Yu, Z., Tao, X., Wan, P.,et al.: Videotetris: Towards compositional text-to-video generation. Advances in Neural Information Processing Systems37, 29489– 29513 (2024)

work page 2024
[44]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Chen, S., Xu, M., Ren, J., Cong, Y., He, S., Xie, Y., Sinha, A., Luo, P., Xiang, T., Perez-Rua, J.-M.: Gentron: Diffusion trans- formers for image and video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6441–6451 (2024)

work page 2024
[45]

International Journal of Computer Vision 133(5), 3059–3078 (2025)

Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P.,et al.: Lavie: High-quality video genera- tion with cascaded latent diffusion models. International Journal of Computer Vision 133(5), 3059–3078 (2025)

work page 2025
[46]

In: Proceed- ings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pp

Zeng, Y., Wei, G., Zheng, J., Zou, J., Wei, Y., Zhang, Y., Li, H.: Make pixels dance: High-dynamic video generation. In: Proceed- ings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pp. 8850–8860 (2024)

work page 2024
[47]

arXiv preprint arXiv:2507.16869 (2025)

Ma, Y., Feng, K., Hu, Z., Wang, X., Wang, Y., Zheng, M., He, X., Zhu, C., Liu, H., He, Y., et al.: Controllable video generation: A survey. arXiv preprint arXiv:2507.16869 (2025)

work page arXiv 2025
[48]

In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp

Hao, Z., Huang, X., Belongie, S.: Controllable video generation with sparse trajectories. In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp. 7854–7863 (2018)

work page 2018
[49]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Hu, Y., Luo, C., Chen, Z.: Make it move: controllable image-to-video generation with text descriptions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18219–18228 (2022)

work page 2022
[50]

In: ACM SIG- GRAPH 2024 Conference Papers, pp

Shi, X., Huang, Z., Wang, F.-Y., Bian, W., Li, D., Zhang, Y., Zhang, M., Cheung, K.C., See, S., Qin, H.,et al.: Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. In: ACM SIG- GRAPH 2024 Conference Papers, pp. 1–11 (2024)

work page 2024
[51]

arXiv preprint arXiv:2403.16407 (2024)

Li, C., Huang, D., Lu, Z., Xiao, Y., Pei, Q., Bai, L.: A survey on long video generation: Challenges, methods, and prospects. arXiv preprint arXiv:2403.16407 (2024)

work page arXiv 2024
[52]

In: European Conference on Computer Vision, pp

Ge, S., Hayes, T., Yang, H., Yin, X., Pang, G., Jacobs, D., Huang, J.-B., Parikh, D.: Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, pp. 102–118 (2022). Springer

work page 2022
[53]

In: Proceedings of the Computer Vision and Pattern Recog- nition Conference, pp

Henschel, R., Khachatryan, L., Poghosyan, H., Hayrapetyan, D., Tadevosyan, V., Wang, Z., Navasardyan, S., Shi, H.: Streamingt2v: Consistent, dynamic, and extendable long video generation from text. In: Proceedings of the Computer Vision and Pattern Recog- nition Conference, pp. 2568–2577 (2025)

work page 2025
[54]

Advances in Neural Information Processing Systems 37, 131434–131455 (2024)

Lu, Y., Liang, Y., Zhu, L., Yang, Y.: Free- long: Training-free long video generation with spectralblend temporal attention. Advances in Neural Information Processing Systems 37, 131434–131455 (2024)

work page 2024
[55]

arXiv preprint arXiv:2509.01085 (2025)

Zhan, C., Li, W., Shen, C., Zhang, J., Wu, S., Zhang, H.: Bidirectional sparse attention for faster video diffusion training. arXiv preprint arXiv:2509.01085 (2025)

work page arXiv 2025
[56]

In: Proceedings of the IEEE/CVF Conference on Computer 23 Vision and Pattern Recognition, pp

Xing, Z., Dai, Q., Hu, H., Wu, Z., Jiang, Y.-G.: Simda: Simple diffusion adapter for efficient video generation. In: Proceedings of the IEEE/CVF Conference on Computer 23 Vision and Pattern Recognition, pp. 7827– 7839 (2024)

work page 2024
[57]

arXiv preprint arXiv:2307.14073 (2023)

Hu, Z., Xu, D.: Videocontrolnet: A motion- guided video-to-video translation framework by using diffusion model with controlnet. arXiv preprint arXiv:2307.14073 (2023)

work page arXiv 2023
[58]

arXiv preprint arXiv:2508.21058 (2025)

Cai, S., Yang, C., Zhang, L., Guo, Y., Xiao, J., Yang, Z., Xu, Y., Yang, Z., Yuille, A., Guibas, L., et al.: Mixture of contexts for long video generation. arXiv preprint arXiv:2508.21058 (2025)

work page arXiv 2025
[59]

Advances in neural information processing systems36, 8406– 8441 (2023)

Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with vari- ational score distillation. Advances in neural information processing systems36, 8406– 8441 (2023)

work page 2023
[60]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Xu, D., Jiang, Y., Wang, P., Fan, Z., Wang, Y., Wang, Z.: Neurallift-360: Lifting an in- the-wild 2d photo to a 3d object with 360deg views. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4479–4489 (2023)

work page 2023
[61]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Tang, J., Wang, T., Zhang, B., Zhang, T., Yi, R., Ma, L., Chen, D.: Make-it- 3d: High-fidelity 3d creation from a single image with diffusion prior. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22819–22829 (2023)

work page 2023
[62]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Obja- verse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13142–13153 (2023)

work page 2023
[63]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and ver- satile 3d generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 21469–21480 (2025)

work page 2025
[64]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to- 3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9298– 9309 (2023)

work page 2023
[65]

Advances in Neural Information Processing Systems36, 22226– 22246 (2023)

Liu, M., Xu, C., Jin, H., Chen, L., Varma T, M., Xu, Z., Su, H.: One-2-3-45: Any sin- gle image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems36, 22226– 22246 (2023)

work page 2023
[66]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Liu, M., Shi, R., Chen, L., Zhang, Z., Xu, C., Wei, X., Chen, H., Zeng, C., Gu, J., Su, H.: One-2-3-45++: Fast single image to 3d objects with consistent multi-view gener- ation and 3d diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10072–10083 (2024)

work page 2024
[67]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Long, X., Guo, Y.-C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.-H., Habermann, M., Theobalt, C.,et al.: Wonder3d: Single image to 3d using cross-domain diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9970–9980 (2024)

work page 2024
[68]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp

Yu, H.-X., Duan, H., Hur, J., Sargent, K., Rubinstein, M., Freeman, W.T., Cole, F., Sun, D., Snavely, N., Wu, J.,et al.: Wonder- journey: Going from anywhere to everywhere. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 6658–6667 (2024)

work page 2024
[69]

Lu- ciddreamer: Domain-free generation of 3d gaussian splatting scenes, 2023

Chung, J., Lee, S., Nam, H., Lee, J., Lee, K.M.: Luciddreamer: Domain-free genera- tion of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384 (2023)

work page arXiv 2023
[70]

In: Proceedings of the Computer Vision and Pat- tern Recognition Conference, pp

Yu, H.-X., Duan, H., Herrmann, C., Free- man, W.T., Wu, J.: Wonderworld: Interactive 3d scene generation from a single image. In: Proceedings of the Computer Vision and Pat- tern Recognition Conference, pp. 5916–5926 (2025)

work page 2025
[71]

arXiv preprint arXiv:2504.02261 (2025)

Ni, C., Wang, X., Zhu, Z., Wang, W., Li, H., Zhao, G., Li, J., Qin, W., Huang, G., 24 Mei, W.: Wonderturbo: Generating interac- tive 3d world in 0.72 seconds. arXiv preprint arXiv:2504.02261 (2025)

work page arXiv 2025
[72]

arXiv preprint arXiv:2503.13265 (2025)

Chen, L., Zhou, Z., Zhao, M., Wang, Y., Zhang, G., Huang, W., Sun, H., Wen, J.-R., Li, C.: Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis. arXiv preprint arXiv:2503.13265 (2025)

work page arXiv 2025
[73]

In: 2025 International Conference on 3D Vision (3DV), pp

Popov, S., Raj, A., Krainin, M., Li, Y., Freeman, W.T., Rubinstein, M.: Camctrl3d: Single-image scene exploration with precise 3d camera control. In: 2025 International Conference on 3D Vision (3DV), pp. 649–658 (2025). IEEE

work page 2025
[74]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Ren, X., Shen, T., Huang, J., Ling, H., Lu, Y., Nimier-David, M., M¨ uller, T., Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with pre- cise camera control. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 6121–6132 (2025)

work page 2025
[75]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20697–20709 (2024)

work page 2024
[76]

Advances in neural information processing systems33, 6840– 6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffu- sion probabilistic models. Advances in neural information processing systems33, 6840– 6851 (2020)

work page 2020
[77]

Advances in neural information processing systems35, 8633–8646 (2022)

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffu- sion models. Advances in neural information processing systems35, 8633–8646 (2022)

work page 2022
[78]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

work page 2022
[79]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Ma, B., Gao, H., Deng, H., Luo, Z., Huang, T., Tang, L., Wang, X.: You see it, you got it: Learning 3d creation on pose-free videos at scale. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 2016–2029 (2025)

work page 2016
[80]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22160–22169 (2024)

work page 2024

Showing first 80 references.