pith. sign in

arxiv: 2606.24138 · v1 · pith:7OU5YU6Znew · submitted 2026-06-23 · 💻 cs.CV

Sat2City v2: Native 3D City Asset Generation from a Single Satellite Image

Pith reviewed 2026-06-26 01:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords satellite image3D city generationtextured meshlatent space adaptationurban asset generationDSM reconstructiongenerative modeling
0
0 comments X

The pith

Sat2City v2 generates reusable textured 3D city meshes from a single satellite image by adapting a pretrained structured-latent 3D foundation model to real data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that explicit, reusable 3D city assets with plausible geometry and controllable appearance can be produced from one satellite image. It achieves this by encoding noisy real meshes into a pretrained native 3D latent space, then fine-tuning a satellite-conditioned geometry flow and texturing step. A new dataset of 16,241 satellite-mesh pairs from 24 regions supports training without relying solely on synthetic data. This shift matters because digital twins and urban simulations need actual mesh assets rather than temporary 3D proxies optimized only for a few views. The method retains a geometry-to-appearance cascade while showing best overall results on metric-scale DSM and generative benchmarks.

Core claim

By encoding each mesh into a pretrained native 3D latent space, fine-tuning a satellite-conditioned geometry flow, and using the decoded shape to anchor satellite-conditioned texturing, Sat2City v2 produces reusable textured mesh assets from weakly aligned satellite images on a real-world dataset, advancing satellite-to-city generation from rendering-oriented proxies to explicit assets.

What carries the argument

Encoding city meshes into a pretrained native structured-latent 3D foundation model and conditioning geometry flow plus texturing on satellite inputs.

If this is right

  • Enables generation of metric-scale DSM reconstructions with improved geometry fidelity.
  • Produces city assets whose appearance remains controllable from the satellite input.
  • Yields explicit textured meshes suitable for downstream simulation and geospatial tasks.
  • Supplies the first documented large satellite-mesh paired dataset collected from matched geographic crops.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The latent-space adaptation may allow similar 3D asset generation pipelines to incorporate new real-world imagery without full retraining.
  • The paired dataset could support future work on multi-view consistency or temporal updates to city models.
  • Controllability from satellite data might extend to editing generated assets when new imagery arrives.

Load-bearing premise

The pretrained native structured-latent 3D foundation model provides a latent space that can be effectively adapted to weakly aligned real-world satellite images and noisy city meshes without major domain gaps or loss of controllability.

What would settle it

Generated meshes from held-out real satellite images show large metric-scale misalignment with ground-truth DSM or produce inconsistent textures when rendered from novel viewpoints.

Figures

Figures reproduced from arXiv: 2606.24138 by Dongli Wu, Hui Xiong, Jinjing Zhu, Tongyan Hua, Wufan Zhao, Ying-Cong Chen, Yinrui Ren, Zhongcheng Hong.

Figure 1
Figure 1. Figure 1: Given a single satellite image, Sat2City v2 generates explicit city-scale geometry and satellite-consistent textured appearance. The examples cover [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of weakly aligned satellite-mesh pairs in the Sat2City v2 dataset. Each pair is collected from a matched geographic crop, but the satellite footprint and textured mesh have only partial overlap and are not pixel-perfectly aligned. The zoom-in views show that real Google Earth meshes provide useful asset-level supervision while retaining coarse geometry and photogrammetric artifacts. or diffic… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Sat2City v2 dataset. (a) Satellite-mesh pairs: each scene provides a geographically matched satellite image and textured 3D city mesh. (b) Derived representations: from this pair, we derive auxiliary point-cloud and height-map representations for analysis and evaluation. (c) Camera￾captured views: the dataset also stores rendered views captured by cameras sampled around the mesh along a hem… view at source ↗
Figure 4
Figure 4. Figure 4: Sat2City v1 and the naive v1.5 upgrade. The upper block follows the ICCV Sat2City pipeline [25]: Ph is the height-map condition point cloud; G is the sparse voxel geometry; AN and AC denote normal and color attributes; PC is the colorized point cloud; tildes denote decoded or generated outputs. The dense VAE encoder Ed and decoder Dd operate on the dense geometry latent XD, while the sparse VAE encoder Es … view at source ↗
Figure 5
Figure 5. Figure 5: System-level meta ablation from v1 to v1.5. (a) Preserving the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pipeline of Sat2City v2. Conditioned on I sat, the fine-tuned geometry flow Fg,θ transports Z shape 1 to Z shape 0 , which the frozen Dg decodes into Mshape. The frozen appearance flow Fa then transports Ztex 1 to Ztex 0 under satellite and geometry conditions; Da decodes material attributes that are baked onto Mshape to produce Mtex. be integrated with fewer sampling steps. Given a satellite image I sat, … view at source ↗
Figure 7
Figure 7. Figure 7: Rendering degradation after SatCity v2 adaptation. (a) Aerial views from the adapted model. (b) Street-level views from its native 360◦ panorama. The poor adapted renderings reflect the mismatch between cylin￾drical panorama training and our hemispherical pinhole-camera supervision; we therefore evaluate the released model only. from t = 1 to t = 0 then maps Gaussian features at these coordinates from Z sh… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative geometry comparison on held-out SatCity inputs. Each row uses the same input crop and compares Sat2Density++, Sat3DGen (Original), [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative appearance comparison on held-out SatCity inputs. Each row uses the same input crop and compares Sat2Density++, Sat3DGen (Original), [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
read the original abstract

Generating explicit 3D city assets from a single satellite image is important for digital twins, urban simulation, and geospatial intelligence. Unlike satellite-to-street-view synthesis, the task requires a reusable textured mesh with plausible geometry and controllable appearance rather than a 3D proxy optimized only for rendering a small set of images or videos. The ICCV Sat2City framework made a first step by conditioning cascaded sparse-voxel latent diffusion on satellite-derived height maps, but its appearance was random, its training data were synthetic, and its task-specific VAE did not scale well to noisy real-world reconstructions. We present Sat2City v2, a journal extension that adapts a pretrained native structured-latent 3D foundation model to weakly aligned satellite images and textured meshes. We build a real-world dataset with 16,241 satellite-mesh pairs across 24 regions in 9 cities. Instead of learning a 3D representation from noisy city meshes, Sat2City v2 encodes each mesh into a pretrained native 3D latent space, fine-tunes a satellite-conditioned geometry flow, and uses the decoded shape to anchor satellite-conditioned texturing. This retains Sat2City's geometry-to-appearance cascade while enabling appearance-controllable generation from the satellite input. Experiments on metric-scale DSM reconstruction and generative city-asset benchmarks for geometry and appearance show that Sat2City v2 achieves the best overall performance among evaluated baselines. Overall, Sat2City v2 advances satellite-to-city generation from rendering-oriented 3D proxies to explicit textured mesh assets, supported by, to the best of our knowledge, the first documented satellite-mesh paired dataset collected from matched geographic crops for this asset-level task. Project page: https://ai4city-hkust.github.io/Sat2City-v2/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents Sat2City v2 as a journal extension of the prior ICCV Sat2City work for generating explicit, reusable textured 3D city meshes from a single satellite image. It adapts a pretrained native structured-latent 3D foundation model via fine-tuning of a satellite-conditioned geometry flow and anchored texturing, introduces a new real-world dataset of 16,241 satellite-mesh pairs across 24 regions in 9 cities, and claims superior performance over baselines on metric-scale DSM reconstruction and generative city-asset benchmarks for both geometry and appearance.

Significance. If the empirical claims hold, the work advances the field from rendering-oriented 3D proxies toward controllable explicit mesh assets suitable for digital twins and urban simulation. The explicit contribution of the first documented satellite-mesh paired dataset collected from matched geographic crops is a clear strength that can support future research. The incremental engineering approach of leveraging and adapting an external pretrained foundation model is appropriately scoped and avoids circularity.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: the central claim that Sat2City v2 'achieves the best overall performance among evaluated baselines' is unsupported by any quantitative metrics, baseline names, error analysis, or statistical tests in the manuscript, which is load-bearing for the empirical contribution.
  2. [Dataset] Dataset construction paragraph: no details are supplied on how the 16,241 satellite-mesh pairs were collected, geographically aligned, or quality-controlled, preventing assessment of whether the real-world data actually closes the domain gap assumed in the method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below and will revise the manuscript to strengthen the empirical support and dataset description.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: the central claim that Sat2City v2 'achieves the best overall performance among evaluated baselines' is unsupported by any quantitative metrics, baseline names, error analysis, or statistical tests in the manuscript, which is load-bearing for the empirical contribution.

    Authors: We acknowledge the need for explicit support of this claim. The Experiments section includes quantitative results on metric-scale DSM reconstruction and generative city-asset benchmarks, with comparisons to baselines using metrics for both geometry and appearance. To address the concern directly, we will revise the abstract to reference key quantitative outcomes and expand the Experiments section to explicitly name all baselines, include detailed error analysis, and report any statistical tests performed. This revision will make the empirical contribution fully transparent. revision: yes

  2. Referee: [Dataset] Dataset construction paragraph: no details are supplied on how the 16,241 satellite-mesh pairs were collected, geographically aligned, or quality-controlled, preventing assessment of whether the real-world data actually closes the domain gap assumed in the method.

    Authors: We agree that additional details are required for reproducibility and to evaluate the domain gap. In the revised manuscript, we will expand the dataset description to specify the data sources (e.g., public satellite imagery providers and mesh reconstruction pipelines), the geographic alignment procedure using coordinate matching from matched crops, and the quality control steps including automated alignment verification and manual review for mesh fidelity and texture consistency. This will directly support assessment of the real-world data's role in the method. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pretrained model and new dataset

full rationale

The paper's method adapts an external pretrained structured-latent 3D foundation model via fine-tuning of a geometry flow and anchored texturing on a newly collected 16k-pair satellite-mesh dataset. No derivation step reduces a claimed prediction or result to a fitted parameter or self-citation by construction. The central claims are empirical benchmark comparisons on DSM reconstruction and generative city-asset tasks, which are supported by the external model and independent data collection rather than internal redefinitions or self-referential uniqueness theorems. This matches the default expectation of a non-circular engineering paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central approach rests on the assumption that a general-purpose 3D foundation model transfers to satellite imagery; no free parameters or invented entities are introduced in the abstract description.

axioms (1)
  • domain assumption Pretrained native structured-latent 3D foundation models encode city geometry in a way that supports fine-tuning from weakly aligned satellite inputs.
    Invoked to justify encoding real meshes into the latent space and fine-tuning the geometry flow.

pith-pipeline@v0.9.1-grok · 5883 in / 1256 out tokens · 32791 ms · 2026-06-26T01:42:51.740074+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

102 extracted references · 10 linked inside Pith

  1. [1]

    Sat2scene: 3d urban scene generation from satellite images with diffusion,

    Z. Li, Z. Li, Z. Cui, M. Pollefeys, and M. R. Oswald, “Sat2scene: 3d urban scene generation from satellite images with diffusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7141–7150

  2. [2]

    Infinicity: Infinite-scale city synthesis,

    C. H. Lin, H.-Y . Lee, W. Menapace, M. Chai, A. Siarohin, M.-H. Yang, and S. Tulyakov, “Infinicity: Infinite-scale city synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 808–22 818

  3. [3]

    Citydreamer: Compositional generative model of unbounded 3d cities,

    H. Xie, Z. Chen, F. Hong, and Z. Liu, “Citydreamer: Compositional generative model of unbounded 3d cities,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9666–9675

  4. [4]

    Gaussiancity: Generative gaussian splatting for unbounded 3d city generation,

    H. Xie, Z. Chen, F. Hong, and Z. Liu, “Gaussiancity: Generative gaussian splatting for unbounded 3d city generation,”arXiv preprint arXiv:2406.06526, 2024

  5. [5]

    Generative gaussian splatting for unbounded 3d city generation,

    H. Xie, Z. Chen, F. Hong, and Z. Liu, “Generative gaussian splatting for unbounded 3d city generation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6111–6120

  6. [6]

    Sketch2scene: Automatic generation of interactive 3d game scenes from user’s casual sketches,

    Y . Xu, Y . Ng, Y . Wang, I. Sa, Y . Duan, Y . Li, P. Ji, and H. Li, “Sketch2scene: Automatic generation of interactive 3d game scenes from user’s casual sketches,”arXiv preprint arXiv:2408.04567, 2024

  7. [7]

    Urban architect: Steerable 3d urban scene generation with layout prior,

    F. Lu, K.-Y . Lin, Y . Xu, H. Li, G. Chen, and C. Jiang, “Urban architect: Steerable 3d urban scene generation with layout prior,”arXiv preprint arXiv:2404.06780, 2024

  8. [8]

    Procedural generation of 3d scenes for urban landscape based on remote sensing images,

    S. Yang, H. Yuan, T. Wang, R. Zhong, C. Song, Y . Fu, W. Ge, and X. Yuan, “Procedural generation of 3d scenes for urban landscape based on remote sensing images,” in2024 IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2024, pp. 1–7

  9. [9]

    Infinicube: Unbounded and controllable dynamic 3d driving scene generation with world-guided video models,

    Y . Lu, X. Ren, J. Yang, T. Shen, Z. Wu, J. Gao, Y . Wang, S. Chen, M. Chen, S. Fidleret al., “Infinicube: Unbounded and controllable dynamic 3d driving scene generation with world-guided video models,” arXiv preprint arXiv:2412.03934, 2024

  10. [10]

    Capturing, reconstructing, and simulating: The urbanscene3d dataset,

    L. Lin, Y . Liu, Y . Hu, X. Yan, K. Xie, and H. Huang, “Capturing, reconstructing, and simulating: The urbanscene3d dataset,” inEuropean Conference on Computer Vision, 2022

  11. [11]

    Urban- world: An urban world model for 3d city generation,

    Y . Shang, J. Chen, H. Fan, J. Ding, J. Feng, and Y . Li, “Urban- world: An urban world model for 3d city generation,”arXiv preprint arXiv:2407.11965, 2024. 14

  12. [12]

    Cityx: Controllable procedural content generation for unbounded 3d cities,

    S. Zhang, M. Zhou, Y . Wang, C. Luo, R. Wang, Y . Li, X. Yin, Z. Zhang, and J. Peng, “Cityx: Controllable procedural content generation for unbounded 3d cities,”arXiv preprint arXiv:2407.17572, 2024

  13. [13]

    Sat2vid: Street-view panoramic video synthesis from a single satellite image,

    Z. Li, Z. Li, Z. Cui, R. Qin, M. Pollefeys, and M. R. Oswald, “Sat2vid: Street-view panoramic video synthesis from a single satellite image,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 436–12 445

  14. [14]

    Geometry-aware satellite-to-ground image synthesis for urban areas,

    X. Lu, Z. Li, Z. Cui, M. R. Oswald, M. Pollefeys, and R. Qin, “Geometry-aware satellite-to-ground image synthesis for urban areas,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 859–867

  15. [16]

    Geometry-guided street- view panorama synthesis from satellite imagery,

    Y . Shi, D. Campbell, X. Yu, and H. Li, “Geometry-guided street- view panorama synthesis from satellite imagery,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 10 009–10 022, 2022

  16. [17]

    Geospecific view generation – geometry-context aware high-resolution ground view inference from satellite views,

    N. Xu and R. Qin, “Geospecific view generation – geometry-context aware high-resolution ground view inference from satellite views,”

  17. [18]

    Available: https://arxiv.org/abs/2407.08061

    [Online]. Available: https://arxiv.org/abs/2407.08061

  18. [19]

    Crossviewdiff: A cross-view diffusion model for satellite-to- street view synthesis,

    W. Li, J. He, J. Ye, H. Zhong, Z. Zheng, Z. Huang, D. Lin, and C. He, “Crossviewdiff: A cross-view diffusion model for satellite-to- street view synthesis,”arXiv preprint arXiv:2408.14765, 2024

  19. [20]

    Streetscapes: Large-scale consistent street view generation using au- toregressive video diffusion,

    B. Deng, R. Tucker, Z. Li, L. Guibas, N. Snavely, and G. Wetzstein, “Streetscapes: Large-scale consistent street view generation using au- toregressive video diffusion,” inACM SIGGRAPH 2024 Conference Papers, 2024, pp. 1–11

  20. [21]

    Ur- bangiraffe: Representing urban scenes as compositional generative neural feature fields,

    Y . Yang, Y . Yang, H. Guo, R. Xiong, Y . Wang, and Y . Liao, “Ur- bangiraffe: Representing urban scenes as compositional generative neural feature fields,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9199–9210

  21. [22]

    Syn- theocc: Synthesize geometric-controlled street view images through 3d semantic mpis,

    L. Li, W. Qiu, Y . Cai, X. Yan, Q. Lian, B. Liu, and Y .-C. Chen, “Syn- theocc: Synthesize geometric-controlled street view images through 3d semantic mpis,”arXiv preprint arXiv:2410.00337, 2024

  22. [23]

    Sat2density: Faithful density learning from satellite-ground image pairs,

    M. Qian, J. Xiong, G.-S. Xia, and N. Xue, “Sat2density: Faithful density learning from satellite-ground image pairs,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3683–3692

  23. [24]

    Seeing through satellite images at street views,

    M. Qian, B. Tan, Q. Wang, X. Zheng, H. Xiong, G.-S. Xia, Y . Shen, and N. Xue, “Seeing through satellite images at street views,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  24. [25]

    Sat3dgen: Comprehensive street-level 3d scene gen- eration from single satellite image,

    M. Qian, Z. Xia, C. Liu, S. Ma, W. Wang, Z. Ke, B. Tan, H. Zhang, and G.-S. Xia, “Sat3dgen: Comprehensive street-level 3d scene gen- eration from single satellite image,” inThe Fourteenth International Conference on Learning Representations, 2026

  25. [26]

    Sat2city: 3d city gener- ation from a single satellite image with cascaded latent diffusion,

    T. Hua, L. Jiang, Y .-C. Chen, and W. Zhao, “Sat2city: 3d city gener- ation from a single satellite image with cascaded latent diffusion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 27 978–27 988

  26. [27]

    Wide-area image geolo- calization with aerial reference imagery,

    S. Workman, R. Souvenir, and N. Jacobs, “Wide-area image geolo- calization with aerial reference imagery,” inProceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3961–3969

  27. [28]

    Lending orientation to neural networks for cross- view geo-localization,

    L. Liu and H. Li, “Lending orientation to neural networks for cross- view geo-localization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5624–5633

  28. [29]

    Vigor: Cross-view image geo- localization beyond one-to-one retrieval,

    S. Zhu, T. Yang, and C. Chen, “Vigor: Cross-view image geo- localization beyond one-to-one retrieval,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 3640–3649

  29. [30]

    Google earth,

    Google, “Google earth,” 2026, accessed: 2026-05-28. [Online]. Available: https://earth.google.com/

  30. [31]

    Xcube: Large-scale 3d generative modeling using sparse voxel hier- archies,

    X. Ren, J. Huang, X. Zeng, K. Museth, S. Fidler, and F. Williams, “Xcube: Large-scale 3d generative modeling using sparse voxel hier- archies,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4209–4219

  31. [32]

    Scube: Instant large-scale scene recon- struction using voxsplats,

    X. Ren, Y . Lu, H. Liang, Z. Wu, H. Ling, M. Chen, S. Fidler, F. Williams, and J. Huang, “Scube: Instant large-scale scene recon- struction using voxsplats,”arXiv preprint arXiv:2410.20030, 2024

  32. [33]

    Sim ´eoni, H

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sen- tana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J ´egou, P. Labatut, and P. Bojanowski, “DINOv3,”arXiv preprint arXiv:2508.1...

  33. [34]

    Native and compact structured latents for 3d generation,

    J. Xiang, X. Chen, S. Xu, R. Wang, Z. Lv, Y . Deng, H. Zhu, Y . Dong, H. Zhao, N. J. Yuan, and J. Yang, “Native and compact structured latents for 3d generation,”arXiv preprint arXiv:2512.14692, 2025

  34. [35]

    3d gaussian splatting for real-time radiance field rendering

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

  35. [36]

    Earthcrafter: Scalable 3d earth generation via dual-sparse latent diffusion,

    S. Liu, C. Cao, C. Yu, W. Qian, J. Wang, and F. Wang, “Earthcrafter: Scalable 3d earth generation via dual-sparse latent diffusion,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 9, 2026, pp. 7260–7268

  36. [37]

    ABot-Earth 0.5: Generative 3d earth model,

    M. Qian, T. Ouyang, M. Sun, Z. Wang, J. Xiong, J. Han, Y . Zhang, J. Zhang, X. Wang, Y . Liu, L. Tang, F. Yu, Z. Ge, M. Du, Y . Liu, N. Fan, S. Wang, Y . Peng, C. Jia, Y . Liu, S. Zeng, H. Shi, J. Lai, H. Pan, Z. Wu, N. Guo, M. Xu, and H. Zhang, “ABot-Earth 0.5: Generative 3d earth model,”arXiv preprint arXiv:2606.09967, 2026

  37. [38]

    Sat2realcity: Geometry- aware and appearance-controllable 3d urban generation from satellite imagery,

    Y . Kang, X. Wang, Z. Wu, Y . Shi, and H. Zhu, “Sat2realcity: Geometry- aware and appearance-controllable 3d urban generation from satellite imagery,”arXiv preprint arXiv:2511.11470, 2025

  38. [39]

    Persistent nature: A generative model of unbounded 3d worlds,

    L. Chai, R. Tucker, Z. Li, P. Isola, and N. Snavely, “Persistent nature: A generative model of unbounded 3d worlds,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 20 863–20 874

  39. [40]

    Scenedreamer: Unbounded 3d scene generation from 2d image collections,

    Z. Chen, G. Wang, and Z. Liu, “Scenedreamer: Unbounded 3d scene generation from 2d image collections,”IEEE transactions on pattern analysis and machine intelligence, 2023

  40. [41]

    Scenescape: Text- driven consistent scene generation,

    R. Fridman, A. Abecasis, Y . Kasten, and T. Dekel, “Scenescape: Text- driven consistent scene generation,”Advances in Neural Information Processing Systems, vol. 36, 2024

  41. [42]

    Gancraft: Unsuper- vised 3d neural rendering of minecraft worlds,

    Z. Hao, A. Mallya, S. Belongie, and M.-Y . Liu, “Gancraft: Unsuper- vised 3d neural rendering of minecraft worlds,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 072–14 082

  42. [43]

    Scene123: One prompt to 3d scene generation via video-assisted and consistency- enhanced mae,

    Y . Yang, F. Yin, J. Fan, X. Chen, W. Li, and G. Yu, “Scene123: One prompt to 3d scene generation via video-assisted and consistency- enhanced mae,”arXiv preprint arXiv:2408.05477, 2024

  43. [44]

    3d-scenedreamer: Text-driven 3d-consistent scene generation,

    S. Zhang, Y . Zhang, Q. Zheng, R. Ma, W. Hua, H. Bao, W. Xu, and C. Zou, “3d-scenedreamer: Text-driven 3d-consistent scene generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 170–10 180

  44. [45]

    Extend3d: Town-scale 3d generation,

    S. Yoon, J. Kim, and J. Park, “Extend3d: Town-scale 3d generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 5892–5901

  45. [46]

    2019 ieee grss data fusion contest: Large-scale semantic 3d reconstruction,

    B. Le Saux, N. Yokoya, R. H ¨ansch, and M. Brown, “2019 ieee grss data fusion contest: Large-scale semantic 3d reconstruction,”IEEE Geoscience and Remote Sensing Magazine, 2019

  46. [47]

    Urbanbis: A large-scale benchmark for fine-grained urban building instance segmentation,

    G. Yang, F. Xue, Q. Zhang, K. Xie, C.-W. Fu, and H. Huang, “Urbanbis: A large-scale benchmark for fine-grained urban building instance segmentation,”ACM Transactions on Graphics, vol. 42, no. 4, 2023

  47. [48]

    Holicity: A city-scale data platform for learning holistic 3d structures,

    Y . Zhou, J. Huang, X. Dai, L. Luo, Z. Chen, and Y . Ma, “Holicity: A city-scale data platform for learning holistic 3d structures,”arXiv preprint arXiv:2008.03286, 2020

  48. [49]

    Omnicity: Omnipotent city understanding with multi-level and multi-view images,

    W. Li, Y . Lai, L. Xu, Y . Xiangli, J. Yu, C. He, G.-S. Xia, and D. Lin, “Omnicity: Omnipotent city understanding with multi-level and multi-view images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 397–17 407

  49. [50]

    Gamus: A geometry-aware multi-modal semantic segmentation benchmark for remote sensing data,

    Z. Xiong, S. Chen, Y . Wang, L. Mou, and X. X. Zhu, “Gamus: A geometry-aware multi-modal semantic segmentation benchmark for remote sensing data,”arXiv preprint arXiv:2305.14914, 2023

  50. [51]

    Nuiscene: Exploring efficient generation of unbounded outdoor scenes,

    H.-H. Lee, Q. Han, and A. X. Chang, “Nuiscene: Exploring efficient generation of unbounded outdoor scenes,”arXiv:2503.16375, 2025

  51. [52]

    Syncity: Training-free generation of 3d worlds,

    P. Engstler, A. Shtedritski, I. Laina, C. Rupprecht, and A. Vedaldi, “Syncity: Training-free generation of 3d worlds,”arXiv:2503.16420, 2025

  52. [53]

    Dreamfusion: Text- to-3d using 2d diffusion,

    B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text- to-3d using 2d diffusion,”arXiv preprint arXiv:2209.14988, 2022

  53. [54]

    Componerf: Text-guided multi-object compositional nerf with editable 3d scene layout,

    H. Bai, Y . Lyu, L. Jiang, S. Li, H. Lu, X. Lin, and L. Wang, “Componerf: Text-guided multi-object compositional nerf with editable 3d scene layout,”arXiv preprint arXiv:2303.13843, 2023

  54. [55]

    Gala3d: Towards text-to-3d complex scene genera- tion via layout-guided generative gaussian splatting,

    X. Zhou, X. Ran, Y . Xiong, J. He, Z. Lin, Y . Wang, D. Sun, and M.-H. Yang, “Gala3d: Towards text-to-3d complex scene genera- tion via layout-guided generative gaussian splatting,”arXiv preprint arXiv:2402.07207, 2024

  55. [56]

    Set-the-scene: Global-local training for generating controllable nerf scenes,

    D. Cohen-Bar, E. Richardson, G. Metzer, R. Giryes, and D. Cohen-Or, “Set-the-scene: Global-local training for generating controllable nerf scenes,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2920–2929. 15

  56. [57]

    Dreamscape: 3d scene cre- ation via gaussian splatting joint correlation modeling,

    X. Yuan, H. Yang, Y . Zhao, and D. Huang, “Dreamscape: 3d scene cre- ation via gaussian splatting joint correlation modeling,”arXiv preprint arXiv:2404.09227, 2024

  57. [58]

    A general framework to boost 3d gs initialization for text-to-3d generation by lexical richness,

    L. Jiang, H. Li, and L. Wang, “A general framework to boost 3d gs initialization for text-to-3d generation by lexical richness,” inProceed- ings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6803–6812

  58. [59]

    Disentangled 3d scene generation with layout learning,

    D. Epstein, B. Poole, B. Mildenhall, A. A. Efros, and A. Holynski, “Disentangled 3d scene generation with layout learning,”arXiv preprint arXiv:2402.16936, 2024

  59. [60]

    Progressive3d: Progressively local editing for text-to-3d content creation with complex semantic prompts,

    X. Cheng, T. Yang, J. Wang, Y . Li, L. Zhang, J. Zhang, and L. Yuan, “Progressive3d: Progressively local editing for text-to-3d content creation with complex semantic prompts,”arXiv preprint arXiv:2310.11784, 2023

  60. [61]

    Lrm: Large reconstruction model for single image to 3d,

    Y . Hong, K. Zhang, J. Gu, S. Bi, Y . Zhou, D. Liu, F. Liu, K. Sunkavalli, T. Bui, and H. Tan, “Lrm: Large reconstruction model for single image to 3d,” inThe Twelfth International Conference on Learning Representations, 2024

  61. [62]

    Triposr: Fast 3d object reconstruction from a single image,

    D. Tochilkin, D. Pankratz, Z. Liu, Z. Huang, A. Letts, Y . Li, D. Liang, C. Laforte, V . Jampani, and Y .-P. Cao, “Triposr: Fast 3d object reconstruction from a single image,”arXiv preprint arXiv:2403.02151, 2024

  62. [64]

    Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation,

    Z. Zhao, Z. Lai, Q. Lin, Y . Zhao, H. Liu, S. Yang, Y . Feng, M. Yang, S. Zhang, X. Yanget al., “Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation,”arXiv preprint arXiv:2501.12202, 2025

  63. [65]

    Meshgen: Generating pbr textured mesh with render-enhanced auto-encoder and generative data augmentation,

    Z. Chen, Y . Wang, W. Sun, F. Wang, Y . Chen, and H. Liu, “Meshgen: Generating pbr textured mesh with render-enhanced auto-encoder and generative data augmentation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5835–5848

  64. [66]

    Sf3d: Stable fast 3d mesh reconstruction with uv-unwrapping and illumination disentanglement,

    M. Boss, Z. Huang, A. Vasishta, and V . Jampani, “Sf3d: Stable fast 3d mesh reconstruction with uv-unwrapping and illumination disentanglement,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 16 240–16 250

  65. [67]

    Lrm: Large reconstruction model for single image to 3d,

    Y . Hong, K. Zhang, J. Gu, S. Bi, Y . Zhou, D. Liu, F. Liu, K. Sunkavalli, T. Bui, and H. Tan, “Lrm: Large reconstruction model for single image to 3d,”arXiv preprint arXiv:2311.04400, 2023

  66. [68]

    Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models,

    J. Xu, W. Cheng, Y . Gao, X. Wang, S. Gao, and Y . Shan, “Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models,”arXiv preprint arXiv:2404.07191, 2024

  67. [69]

    Clay: A controllable large-scale generative model for creating high-quality 3d assets,

    L. Zhang, Z. Wang, Q. Zhang, Q. Qiu, A. Pang, H. Jiang, W. Yang, L. Xu, and J. Yu, “Clay: A controllable large-scale generative model for creating high-quality 3d assets,”ACM Transactions on Graphics (TOG), vol. 43, no. 4, pp. 1–20, 2024

  68. [70]

    3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models,

    B. Zhang, J. Tang, M. Niessner, and P. Wonka, “3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models,” ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–16, 2023

  69. [71]

    Lagem: A large geometry model for 3d rep- resentation learning and diffusion,

    B. Zhang and P. Wonka, “Lagem: A large geometry model for 3d rep- resentation learning and diffusion,”arXiv preprint arXiv:2410.01295, 2024

  70. [72]

    Meshgpt: Generating trian- gle meshes with decoder-only transformers,

    Y . Siddiqui, A. Alliegro, A. Artemov, T. Tommasi, D. Sirigatti, V . Rosov, A. Dai, and M. Nießner, “Meshgpt: Generating trian- gle meshes with decoder-only transformers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 615–19 625

  71. [73]

    Meshanything: Artist-created mesh generation with autoregressive transformers,

    Y . Chen, T. He, D. Huang, W. Ye, S. Chen, J. Tang, X. Chen, Z. Cai, L. Yang, G. Yuet al., “Meshanything: Artist-created mesh generation with autoregressive transformers,”arXiv preprint arXiv:2406.10163, 2024

  72. [74]

    Meshanything v2: Artist-created mesh generation with adjacent mesh tokenization,

    Y . Chen, Y . Wang, Y . Luo, Z. Wang, Z. Chen, J. Zhu, C. Zhang, and G. Lin, “Meshanything v2: Artist-created mesh generation with adjacent mesh tokenization,”arXiv preprint arXiv:2408.02555, 2024

  73. [75]

    Meshxl: Neural coordinate field for generative 3d foundation models,

    S. Chen, X. Chen, A. Pang, X. Zeng, W. Cheng, Y . Fu, F. Yin, Y . Wang, Z. Wang, C. Zhanget al., “Meshxl: Neural coordinate field for generative 3d foundation models,”arXiv preprint arXiv:2405.20853, 2024

  74. [76]

    Edgerunner: Auto-regressive auto-encoder for artistic mesh genera- tion,

    J. Tang, Z. Li, Z. Hao, X. Liu, G. Zeng, M.-Y . Liu, and Q. Zhang, “Edgerunner: Auto-regressive auto-encoder for artistic mesh genera- tion,”arXiv preprint arXiv:2409.18114, 2024

  75. [77]

    Pivotmesh: Generic 3d mesh generation via pivot vertices guidance,

    H. Weng, Y . Wang, T. Zhang, C. Chen, and J. Zhu, “Pivotmesh: Generic 3d mesh generation via pivot vertices guidance,”arXiv preprint arXiv:2405.16890, 2024

  76. [78]

    Llama-mesh: Unifying 3d mesh generation with language models,

    Z. Wang, J. Lorraine, Y . Wang, H. Su, J. Zhu, S. Fidler, and X. Zeng, “Llama-mesh: Unifying 3d mesh generation with language models,” arXiv preprint arXiv:2411.09595, 2024

  77. [79]

    Structured 3d latents for scalable and versatile 3d generation,

    J. Xiang, Z. Lv, S. Xu, Y . Deng, R. Wang, B. Zhang, D. Chen, X. Tong, and J. Yang, “Structured 3d latents for scalable and versatile 3d generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2025, pp. 21 469–21 480

  78. [80]

    Blender - a 3d modeling and animation software,

    B. Foundation, “Blender - a 3d modeling and animation software,” 2025, version 4.2, accessed: 2025-02-19. [Online]. Available: https://www.blender.org

  79. [81]

    Cloudcompare (version 2.12.4) [gpl software],

    C. D. Team, “Cloudcompare (version 2.12.4) [gpl software],” 2025, retrieved on 2025-02-19. [Online]. Available: http://www. cloudcompare.org/

  80. [82]

    Depth anything: Unleashing the power of large-scale unlabeled data,

    L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 371–10 381

Showing first 80 references.