ABot-Earth 0.5: Generative 3D Earth Model

Chunxue Jia; Fei Yu; Hang Zhang; Haozhe Shi; Hongyu Pan; Jiarong Han; Jiawei Zhang; Jincheng Xiong; Junnan Lai; Luyang Tang

arxiv: 2606.09967 · v1 · pith:N4HZCVVBnew · submitted 2026-06-08 · 💻 cs.CV

ABot-Earth 0.5: Generative 3D Earth Model

Ming Qian , Tianjian Ouyang , Mingchao Sun , Zijian Wang , Jincheng Xiong , Jiarong Han , Yongchang Zhang , Jiawei Zhang

show 20 more authors

Xu Wang Yu Liu Luyang Tang Fei Yu Zengye Ge Mengmeng Du Yuan Liu Nianfei Fan Song Wang Yingliang Peng Chunxue Jia Yang Liu Shiying Zeng Haozhe Shi Junnan Lai Hongyu Pan Zheng Wu Ning Guo Mu Xu Hang Zhang

This is my paper

Pith reviewed 2026-06-27 16:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords generative 3D modelsatellite imagery3D Gaussian Splattingurban reconstructionlevel of detailEmbodied AIdigital earth3D scene synthesis

0 comments

The pith

ABot-Earth 0.5 generates novel 3D scenes from satellite imagery alone using a 3D Gaussian Splatting model trained on urban reconstructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a generative framework that learns to produce realistic 3D geometry and textures directly from geospatially referenced satellite images. It trains the model on collections of existing real-world urban 3D reconstructions so that, at inference time, new scenes can be synthesized without any additional 3D input. The resulting scenes run at under 10 minutes per square kilometer and include hierarchical level-of-detail structures for real-time web display. This approach targets applications such as closed-loop UAV navigation by reducing the sim-to-real gap in simulation environments.

Core claim

ABot-Earth 0.5 formulates a generative model directly in the 3D Gaussian Splatting representation; after training on a diverse corpus of real-world urban reconstructions, the model produces novel, seamless 3D environments conditioned solely on satellite imagery, achieving synthesis rates under 10 minutes per square kilometer together with integrated LOD structures that support real-time web-based visualization.

What carries the argument

A generative model formulated directly with the 3D Gaussian Splatting (3DGS) representation that learns geometry and texture from satellite imagery inputs.

If this is right

The generated scenes include hierarchical LOD structures that enable real-time interactive visualization inside web-based map engines.
The framework supplies high-fidelity simulation environments that reduce the sim-to-real domain gap for downstream Embodied AI tasks such as closed-loop UAV navigation.
Synthesis at under 10 minutes per square kilometer supplies an ultra-low-cost route to large-scale 3D reconstruction at global coverage.
The same trained model can be applied to any geospatially referenced satellite imagery without requiring additional 3D training data for each new region.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning mechanism could be tested on non-urban satellite imagery such as agricultural or coastal regions to check generalization limits.
Integration with existing global satellite archives would allow on-demand 3D model generation for any location covered by the training distribution.
The output 3DGS scenes could serve as training environments for reinforcement-learning agents that require dense, textured geometry beyond what 2D image simulators provide.

Load-bearing premise

Training on existing urban reconstructions will let the model generalize to produce accurate geometry and textures from new satellite imagery alone.

What would settle it

Quantitative comparison of generated 3D models against ground-truth LiDAR or photogrammetry on a held-out set of satellite images showing systematic deviations in geometry or texture fidelity.

read the original abstract

We present ABot-Earth 0.5, a generative 3D framework designed to synthesize vast, seamless 3D environments from ubiquitous, geospatially referenced satellite imagery. To achieve this, we propose a novel generative model formulated directly with the 3D Gaussian Splatting (3DGS) representation. The model is trained on a diverse corpus of existing real-world urban reconstructions, learning to generate realistic geometry and textures. At inference, it synthesizes novel 3D scenes conditioned solely on satellite imagery at a scalable rate of under 10 minutes per square kilometer, while demonstrating exceptional realism. The framework is designed for accessibility, with integrated hierarchical level-of-detail (LOD) structures that permit real-time, interactive visualization on web-based map engines. This high-fidelity simulation sandbox effectively mitigates the sim-to-real domain gap, enabling critical downstream Embodied AI applications like closed-loop UAV navigation. By providing an ultra-low-cost and high-efficiency solution, ABot-Earth 0.5 significantly lowers the technical and financial barriers to large-scale 3D reconstruction and empowers the future of global digital earth visualization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract's claim of satellite-conditioned 3DGS synthesis at inference has no described training pathway or paired data to support it.

read the letter

The main thing to know is that the training description only covers learning from real-world urban 3D reconstructions, with no mention of satellite imagery, paired satellite-3D data, or any conditioning architecture. This leaves the central inference claim without an evident mechanism.

The paper applies 3D Gaussian Splatting in a generative setup aimed at large-scale Earth models. It highlights practical elements like hierarchical LOD for web map engines and targets downstream uses in Embodied AI such as UAV navigation by cutting the sim-to-real gap. The stated speed of under 10 minutes per square kilometer would be useful if backed up.

The soft spot is exactly the mismatch flagged in the stress-test. The abstract gives no equations, architecture details, or experimental results, so the realism and generalization claims cannot be checked against the paper's evidence. The concern holds up on the provided text.

This is for applied researchers in 3D reconstruction or simulation who might want scalable Earth models. A reader could extract the high-level idea, but the evidential gaps make it unsuitable for serious refereeing right now.

Recommendation: desk reject unless the full paper supplies the missing training and conditioning details with quantitative support.

Referee Report

2 major / 1 minor

Summary. The paper presents ABot-Earth 0.5, a generative 3D framework that synthesizes vast, seamless 3D environments from geospatially referenced satellite imagery using a novel model formulated directly in the 3D Gaussian Splatting (3DGS) representation. It is trained on a diverse corpus of existing real-world urban reconstructions to learn realistic geometry and textures. At inference, the model generates novel 3D scenes conditioned solely on satellite imagery, achieving synthesis rates under 10 minutes per square kilometer with exceptional realism. The framework incorporates hierarchical level-of-detail (LOD) structures for real-time web-based visualization and targets applications in Embodied AI such as closed-loop UAV navigation by mitigating the sim-to-real gap.

Significance. If substantiated, the approach would offer a scalable, low-cost method for large-scale 3D reconstruction from ubiquitous satellite data, enabling global digital earth models and supporting downstream Embodied AI tasks. The direct use of 3DGS combined with hierarchical LOD for web accessibility represents a practical direction for interactive 3D earth visualization.

major comments (2)

[Abstract] Abstract: The central inference claim—that the model synthesizes novel 3D scenes conditioned solely on satellite imagery—has no described training pathway. Training is stated only as occurring on 3D reconstructions with no reference to satellite imagery inputs, paired satellite-3D data, image encoder, cross-attention layers, or any conditioning architecture. This gap is load-bearing for the primary contribution.
[Abstract] Abstract: Claims of 'fast synthesis' (under 10 minutes per square kilometer) and 'exceptional realism' are asserted without any quantitative metrics, baselines, error analysis, ablation studies, or experimental details, preventing evaluation of the results against the paper's own evidence.

minor comments (1)

[Abstract] Abstract: The version designation '0.5' is used without any description of prior versions, changes, or what distinguishes this release.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our submission. We provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: [Abstract] Abstract: The central inference claim—that the model synthesizes novel 3D scenes conditioned solely on satellite imagery—has no described training pathway. Training is stated only as occurring on 3D reconstructions with no reference to satellite imagery inputs, paired satellite-3D data, image encoder, cross-attention layers, or any conditioning architecture. This gap is load-bearing for the primary contribution.

Authors: The referee correctly identifies that the abstract does not detail the conditioning architecture. We will revise the abstract to explicitly mention the use of paired satellite-3D data during training and the incorporation of an image encoder with cross-attention layers for conditioning the generative model on satellite imagery at inference time. revision: yes
Referee: [Abstract] Abstract: Claims of 'fast synthesis' (under 10 minutes per square kilometer) and 'exceptional realism' are asserted without any quantitative metrics, baselines, error analysis, ablation studies, or experimental details, preventing evaluation of the results against the paper's own evidence.

Authors: We agree that the current manuscript lacks the quantitative evaluations mentioned. In the revised version, we will add sections with quantitative metrics, comparisons to baselines, error analysis, and ablation studies to support the claims of synthesis speed and realism. revision: yes

Circularity Check

0 steps flagged

No circularity detected

full rationale

The provided abstract and description contain no equations, derivations, predictions, or self-citations. The model is described as trained on real-world 3D reconstructions to generate geometry and textures, with inference conditioned on satellite imagery, but no mathematical steps or load-bearing claims reduce by construction to fitted inputs or self-referential definitions. No uniqueness theorems, ansatzes, or renamings of known results are invoked. The paper's claims rest on training data and architecture details not shown to be circular in the text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no explicit free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5824 in / 1070 out tokens · 25859 ms · 2026-06-27T16:51:56.185252+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sat2City v2: Native 3D City Asset Generation from a Single Satellite Image
cs.CV 2026-06 unverdicted novelty 5.0

Sat2City v2 adapts a pretrained native 3D latent model to generate controllable textured 3D city assets from satellite images via geometry flow fine-tuning and anchored texturing on a collected real dataset.

Reference graph

Works this paper leans on

47 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

Structure-from-motion revisited,

J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” inConference on Computer Vision and Pattern Recognition (CVPR), 2016

2016
[2]

Pixelwise view selection for unstructured multi-view stereo,

J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, “Pixelwise view selection for unstructured multi-view stereo,” inEuropean Conference on Computer Vision (ECCV), 2016

2016
[3]

Uav for 3d mapping applications: A review,

F. Nex and F. Remondino, “Uav for 3d mapping applications: A review,”Applied geomatics, vol. 6, no. 1, pp. 1–15, 2014

2014
[4]

Airbornelaserscanning—anintroductionandoverview,

A.WehrandU.Lohr, “Airbornelaserscanning—anintroductionandoverview,” ISPRSJournalofPhotogrammetry and Remote Sensing, vol. 54, no. 2, pp. 68–82, 1999

1999
[5]

Shan and C

J. Shan and C. K. Toth, Eds.,Topographic Laser Ranging and Scanning: Principles and Processing. Boca Raton: CRC Press, 2018

2018
[6]

Native and compact structured latents for 3d generation,

J. Xiang, X. Chen, S. Xu, R. Wang, Z. Lv, Y. Deng, H. Zhu, Y. Dong, H. Zhao, N. J. Yuan, and J. Yang, “Native and compact structured latents for 3d generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2026, pp. 14419–14429

2026
[7]

Structured 3d latents for scalable and versatile 3d generation,

J. Xiang, Z. Lv, S. Xu, Y. Deng, R. Wang, B. Zhang, D. Chen, X. Tong, and J. Yang, “Structured 3d latents for scalable and versatile 3d generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2025, pp. 21469–21480

2025
[8]

Clay: A controllable large-scale generative model for creating high-quality 3d assets,

L. Zhang, Z. Wang, Q. Zhang, Q. Qiu, A. Pang, H. Jiang, W. Yang, L. Xu, and J. Yu, “Clay: A controllable large-scale generative model for creating high-quality 3d assets,”ACM Trans.Graph., vol. 43, no. 4, Jul. 2024. [Online]. Available: https://doi.org/10.1145/3658146

work page doi:10.1145/3658146 2024
[9]

Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation,

T. H. Team, “Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation,” 2025

2025
[10]

Hunyuan3d 2.5: Towards high-fidelity 3d assets generation with ultimate details,

——, “Hunyuan3d 2.5: Towards high-fidelity 3d assets generation with ultimate details,” 2025. [Online]. Available: https://arxiv.org/abs/2506.16504

Pith/arXiv arXiv 2025
[11]

Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets,

J. Feng, X. Li, J. Lin, J. Liu, G. Liu, W. Lou, S. Ma, G. Shi, Q. Wang, J. Wang, Z. Xu, X. Yi, Z. Yu, J. Zhang, Y. Zhu, R. Chen, J. Chi, Z. Du, L. Han, L. Huang, K. Jiang, Y. Li, G. Luo, S. Wang, Q. Wu, F. Yang, J. Zhang, and X. Zhang, “Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets,” 2025. [Online]. Available: https://arxiv.org/abs/2...

arXiv 2025
[12]

Get3d: A generative model of high quality 3d textured shapes learned from images,

J. Gao, T. Shen, Z. Wang, W. Chen, K. Yin, D. Li, O. Litany, Z. Gojcic, and S. Fidler, “Get3d: A generative model of high quality 3d textured shapes learned from images,” inAdvances In Neural Information Processing Systems, 2022

2022
[13]

Shap-e: Generating conditional 3d implicit functions,

H. Jun and A. Nichol, “Shap-e: Generating conditional 3d implicit functions,” 2023. [Online]. Available: https://arxiv.org/abs/2305.02463

Pith/arXiv arXiv 2023
[14]

Earthcrafter: Scalable 3d earth generation via dual- sparse latent diffusion,

S. Liu, C. Cao, C. Yu, W. Qian, J. Wang, and F. Wang, “Earthcrafter: Scalable 3d earth generation via dual- sparse latent diffusion,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 9, 2026, pp. 7260–7268

2026
[15]

Citydreamer: Compositional generative model of unbounded 3d cities,

H. Xie, Z. Chen, F. Hong, and Z. Liu, “Citydreamer: Compositional generative model of unbounded 3d cities,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 9666–9675

2024
[16]

Sat2city: 3d city generation from a single satellite image with cascaded latent diffusion,

H. Huaet al., “Sat2city: 3d city generation from a single satellite image with cascaded latent diffusion,” inICCV, 2025

2025
[17]

Infinicity: Infinite- scale city synthesis,

C. H. Lin, H.-Y. Lee, W. Menapace, M. Chai, A. Siarohin, M.-H. Yang, and S. Tulyakov, “Infinicity: Infinite- scale city synthesis,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 22808–22818

2023
[18]

Sat2scene: 3d urban scene generation from satellite images with diffusion,

Y. Liet al., “Sat2scene: 3d urban scene generation from satellite images with diffusion,” inCVPR, 2024

2024
[19]

Urbangiraffe: Representing urban scenes as compositional generative neural feature fields,

Y. Yang, Y. Yang, H. Guo, R. Xiong, Y. Wang, and Y. Liao, “Urbangiraffe: Representing urban scenes as compositional generative neural feature fields,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 9199–9210. 15

2023
[20]

Sat2density: Faithful density learning from satellite-ground image pairs,

M. Qian, J. Xiong, G.-S. Xia, and N. Xue, “Sat2density: Faithful density learning from satellite-ground image pairs,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 3683–3692

2023
[21]

Seeing through satellite images at street views,

M. Qian, B. Tan, Q. Wang, X. Zheng, H. Xiong, G.-S. Xia, Y. Shen, and N. Xue, “Seeing through satellite images at street views,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 5, pp. 5692–5709, 2026

2026
[22]

Sat3DGen: Comprehensive street-level 3d scene generation from single satellite image,

M. Qian, Z. Xia, C. Liu, S. Ma, W. Wang, Z. Ke, B. Tan, H. Zhang, and G.-S. Xia, “Sat3DGen: Comprehensive street-level 3d scene generation from single satellite image,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=E7JzkZCofa

2026
[23]

Infinite nature: Perpetual view generation of natural scenes from a single image,

A. Liu, R. Tucker, V. Jampani, A. Makadia, N. Snavely, and A. Kanazawa, “Infinite nature: Perpetual view generation of natural scenes from a single image,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021

2021
[24]

Generative gaussian splatting for unbounded 3d city generation,

H. Xie, Z. Chen, F. Hong, and Z. Liu, “Generative gaussian splatting for unbounded 3d city generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 6111–6120

2025
[25]

Domain randomization for transferring deep neural networks from simulation to the real world,

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 23–30

2017
[26]

3d gaussian splatting for real-time radiance field rendering,

B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Transactions on Graphics, vol. 42, no. 4, July 2023. [Online]. Available: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

2023
[27]

Citygaussian: Real-time high-quality large-scale scene rendering with gaussians,

Y. Liu, H. Guan, C. Luo, L. Fan, J. Peng, and Z. Zhang, “Citygaussian: Real-time high-quality large-scale scene rendering with gaussians,” 2024

2024
[28]

Airsim: High-fidelity visual and physical simulation for autonomous vehicles,

S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,”CoRR, vol. abs/1705.05065, 2017. [Online]. Available: http://arxiv.org/abs/1705.05065

Pith/arXiv arXiv 2017
[29]

Flightgoggles: Photorealistic sensor simulation for perception-driven robotics using photogrammetry and virtual reality,

W. Guerra, E. Tal, V. Murali, G. Ryou, and S. Karaman, “Flightgoggles: Photorealistic sensor simulation for perception-driven robotics using photogrammetry and virtual reality,” in2019 IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS), 2019, pp. 6941–6948

2019
[30]

Video generation models as world simulators,

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh, “Video generation models as world simulators,” OpenAI Technical Report, 2024. [Online]. Available: https://openai.com/research/video-generation-models-as-world-simulators

2024
[31]

2019 IEEE GRSS data fusion contest: Large-scale semantic 3d reconstruction,

B. Le Saux, N. Yokoya, R. Hansch, M. Brown, and G. Hager, “2019 IEEE GRSS data fusion contest: Large-scale semantic 3d reconstruction,”IEEE GRSS Magazine, 2019, worldView-3 multi-stereo satellite imagery, Jacksonville FL

2019
[32]

From orbit to ground: Generative city photogrammetry from extreme off-nadir satellite images,

F. Yu, Y. Liu, L. Tang, M. Sun, Z. Ge, R. Bu, Y. Jin, H. Zhao, H. Sun, Y. Li, M. Xu, W. Chen, and B. Chen, “From orbit to ground: Generative city photogrammetry from extreme off-nadir satellite images,” 2026. [Online]. Available: https://arxiv.org/abs/2512.07527

Pith/arXiv arXiv 2026
[33]

Capturing, reconstructing, and simulating: the urbanscene3d dataset,

L. Lin, Y. Liu, Y. Hu, X. Yan, K. Xie, and H. Huang, “Capturing, reconstructing, and simulating: the urbanscene3d dataset,” inEuropean Conference on Computer Vision (ECCV), 2022

2022
[34]

Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs,

H. Turki, D. Ramanan, and M. Satyanarayanan, “Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[35]

Drone-assisted road gaussian splatting with cross-view uncertainty,

S. Zhang, B. Ye, X. Chen, Y. Chen, Z. Zhang, C. Peng, Y. Shi, and H. Zhao, “Drone-assisted road gaussian splatting with cross-view uncertainty,” inarXiv preprint arXiv:2408.15242, 2024

arXiv 2024
[36]

Urbanbis: a large-scale benchmark for fine-grained urban building instance segmentation,

G. Yang, F. Xue, Q. Zhang, K. Xie, C.-W. Fu, and H. Huang, “Urbanbis: a large-scale benchmark for fine-grained urban building instance segmentation,”ACM Transactions on Graphics (SIGGRAPH), vol. 42, no. 4, 2023

2023
[37]

Crossloc: Scalable aerial localization assisted by multimodal synthetic data,

Q. Yan, J. Zheng, S. Reding, S. Li, and I. Doytchinov, “Crossloc: Scalable aerial localization assisted by multimodal synthetic data,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 16

2022
[38]

Uavd4l: A large-scale dataset for uav 6-dof localization,

R. Wu, X. Cheng, J. Zhu, X. Liu, M. Zhang, and S. Yan, “Uavd4l: A large-scale dataset for uav 6-dof localization,” in International Conference on 3D Vision (3DV), 2024

2024
[39]

Vision-based uav self-positioning in low-altitude urban environments,

M. Dai, E. Zheng, Z. Feng, L. Qi, J. Zhuang, and W. Yang, “Vision-based uav self-positioning in low-altitude urban environments,”IEEE Transactions on Image Processing, vol. 33, pp. 493–508, 2024

2024
[40]

Clod-gs: Continuous level-of-detail via 3d gaussian splatting,

Z. Cheng, M. Sun, Y. Liu, Z. Ge, L. Tang, M. Xu, Y. Li, and P. Pan, “Clod-gs: Continuous level-of-detail via 3d gaussian splatting,” inInternational Conference on Learning Representations, 2025. [Online]. Available: https://arxiv.org/abs/2510.09997

arXiv 2025
[41]

Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation,

T. H. Team, “Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation,” 2024

2024
[42]

Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation,

Z. Wu, Y. Li, H. Yan, T. Shang, W. Sun, S. Wang, R. Cui, W. Liu, H. Sato, H. Liet al., “Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation,”ACM Transactionson Graphics (ToG), vol. 43, no. 4, pp. 1–17, 2024

2024
[43]

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies,

X. Ren, J. Huang, X. Zeng, K. Museth, S. Fidler, and F. Williams, “Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 4209–4219

2024
[44]

OGC Abstract Specification Topic 2: Referencing by Coordinates (OGC 18-005r8),

Open Geospatial Consortium, “OGC Abstract Specification Topic 2: Referencing by Coordinates (OGC 18-005r8),” Open Geospatial Consortium, Abstract Specification, 2023. [Online]. Available: https://docs.ogc.org/as/18-005r8/18-005r8.pdf

2023
[45]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” inAdvancesin Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017

2017
[46]

Demystifying mmd gans,

M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying mmd gans,” 2021. [Online]. Available: https://arxiv.org/abs/1801.01401

Pith/arXiv arXiv 2021
[47]

Lrm: Large reconstruction model for single image to 3d,

Y. Hong, K. Zhang, J. Gu, S. Bi, Y. Zhou, D. Liu, F. Liu, K. Sunkavalli, T. Bui, and H. Tan, “Lrm: Large reconstruction model for single image to 3d,” inInternational Conference on Learning Representations, vol. 2024, 2024, pp. 50678–50702. 17

2024

[1] [1]

Structure-from-motion revisited,

J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” inConference on Computer Vision and Pattern Recognition (CVPR), 2016

2016

[2] [2]

Pixelwise view selection for unstructured multi-view stereo,

J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, “Pixelwise view selection for unstructured multi-view stereo,” inEuropean Conference on Computer Vision (ECCV), 2016

2016

[3] [3]

Uav for 3d mapping applications: A review,

F. Nex and F. Remondino, “Uav for 3d mapping applications: A review,”Applied geomatics, vol. 6, no. 1, pp. 1–15, 2014

2014

[4] [4]

Airbornelaserscanning—anintroductionandoverview,

A.WehrandU.Lohr, “Airbornelaserscanning—anintroductionandoverview,” ISPRSJournalofPhotogrammetry and Remote Sensing, vol. 54, no. 2, pp. 68–82, 1999

1999

[5] [5]

Shan and C

J. Shan and C. K. Toth, Eds.,Topographic Laser Ranging and Scanning: Principles and Processing. Boca Raton: CRC Press, 2018

2018

[6] [6]

Native and compact structured latents for 3d generation,

J. Xiang, X. Chen, S. Xu, R. Wang, Z. Lv, Y. Deng, H. Zhu, Y. Dong, H. Zhao, N. J. Yuan, and J. Yang, “Native and compact structured latents for 3d generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2026, pp. 14419–14429

2026

[7] [7]

Structured 3d latents for scalable and versatile 3d generation,

J. Xiang, Z. Lv, S. Xu, Y. Deng, R. Wang, B. Zhang, D. Chen, X. Tong, and J. Yang, “Structured 3d latents for scalable and versatile 3d generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2025, pp. 21469–21480

2025

[8] [8]

Clay: A controllable large-scale generative model for creating high-quality 3d assets,

L. Zhang, Z. Wang, Q. Zhang, Q. Qiu, A. Pang, H. Jiang, W. Yang, L. Xu, and J. Yu, “Clay: A controllable large-scale generative model for creating high-quality 3d assets,”ACM Trans.Graph., vol. 43, no. 4, Jul. 2024. [Online]. Available: https://doi.org/10.1145/3658146

work page doi:10.1145/3658146 2024

[9] [9]

Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation,

T. H. Team, “Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation,” 2025

2025

[10] [10]

Hunyuan3d 2.5: Towards high-fidelity 3d assets generation with ultimate details,

——, “Hunyuan3d 2.5: Towards high-fidelity 3d assets generation with ultimate details,” 2025. [Online]. Available: https://arxiv.org/abs/2506.16504

Pith/arXiv arXiv 2025

[11] [11]

Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets,

J. Feng, X. Li, J. Lin, J. Liu, G. Liu, W. Lou, S. Ma, G. Shi, Q. Wang, J. Wang, Z. Xu, X. Yi, Z. Yu, J. Zhang, Y. Zhu, R. Chen, J. Chi, Z. Du, L. Han, L. Huang, K. Jiang, Y. Li, G. Luo, S. Wang, Q. Wu, F. Yang, J. Zhang, and X. Zhang, “Seed3d 1.0: From images to high-fidelity simulation-ready 3d assets,” 2025. [Online]. Available: https://arxiv.org/abs/2...

arXiv 2025

[12] [12]

Get3d: A generative model of high quality 3d textured shapes learned from images,

J. Gao, T. Shen, Z. Wang, W. Chen, K. Yin, D. Li, O. Litany, Z. Gojcic, and S. Fidler, “Get3d: A generative model of high quality 3d textured shapes learned from images,” inAdvances In Neural Information Processing Systems, 2022

2022

[13] [13]

Shap-e: Generating conditional 3d implicit functions,

H. Jun and A. Nichol, “Shap-e: Generating conditional 3d implicit functions,” 2023. [Online]. Available: https://arxiv.org/abs/2305.02463

Pith/arXiv arXiv 2023

[14] [14]

Earthcrafter: Scalable 3d earth generation via dual- sparse latent diffusion,

S. Liu, C. Cao, C. Yu, W. Qian, J. Wang, and F. Wang, “Earthcrafter: Scalable 3d earth generation via dual- sparse latent diffusion,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 9, 2026, pp. 7260–7268

2026

[15] [15]

Citydreamer: Compositional generative model of unbounded 3d cities,

H. Xie, Z. Chen, F. Hong, and Z. Liu, “Citydreamer: Compositional generative model of unbounded 3d cities,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 9666–9675

2024

[16] [16]

Sat2city: 3d city generation from a single satellite image with cascaded latent diffusion,

H. Huaet al., “Sat2city: 3d city generation from a single satellite image with cascaded latent diffusion,” inICCV, 2025

2025

[17] [17]

Infinicity: Infinite- scale city synthesis,

C. H. Lin, H.-Y. Lee, W. Menapace, M. Chai, A. Siarohin, M.-H. Yang, and S. Tulyakov, “Infinicity: Infinite- scale city synthesis,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 22808–22818

2023

[18] [18]

Sat2scene: 3d urban scene generation from satellite images with diffusion,

Y. Liet al., “Sat2scene: 3d urban scene generation from satellite images with diffusion,” inCVPR, 2024

2024

[19] [19]

Urbangiraffe: Representing urban scenes as compositional generative neural feature fields,

Y. Yang, Y. Yang, H. Guo, R. Xiong, Y. Wang, and Y. Liao, “Urbangiraffe: Representing urban scenes as compositional generative neural feature fields,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 9199–9210. 15

2023

[20] [20]

Sat2density: Faithful density learning from satellite-ground image pairs,

M. Qian, J. Xiong, G.-S. Xia, and N. Xue, “Sat2density: Faithful density learning from satellite-ground image pairs,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 3683–3692

2023

[21] [21]

Seeing through satellite images at street views,

M. Qian, B. Tan, Q. Wang, X. Zheng, H. Xiong, G.-S. Xia, Y. Shen, and N. Xue, “Seeing through satellite images at street views,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 5, pp. 5692–5709, 2026

2026

[22] [22]

Sat3DGen: Comprehensive street-level 3d scene generation from single satellite image,

M. Qian, Z. Xia, C. Liu, S. Ma, W. Wang, Z. Ke, B. Tan, H. Zhang, and G.-S. Xia, “Sat3DGen: Comprehensive street-level 3d scene generation from single satellite image,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=E7JzkZCofa

2026

[23] [23]

Infinite nature: Perpetual view generation of natural scenes from a single image,

A. Liu, R. Tucker, V. Jampani, A. Makadia, N. Snavely, and A. Kanazawa, “Infinite nature: Perpetual view generation of natural scenes from a single image,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021

2021

[24] [24]

Generative gaussian splatting for unbounded 3d city generation,

H. Xie, Z. Chen, F. Hong, and Z. Liu, “Generative gaussian splatting for unbounded 3d city generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 6111–6120

2025

[25] [25]

Domain randomization for transferring deep neural networks from simulation to the real world,

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 23–30

2017

[26] [26]

3d gaussian splatting for real-time radiance field rendering,

B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Transactions on Graphics, vol. 42, no. 4, July 2023. [Online]. Available: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

2023

[27] [27]

Citygaussian: Real-time high-quality large-scale scene rendering with gaussians,

Y. Liu, H. Guan, C. Luo, L. Fan, J. Peng, and Z. Zhang, “Citygaussian: Real-time high-quality large-scale scene rendering with gaussians,” 2024

2024

[28] [28]

Airsim: High-fidelity visual and physical simulation for autonomous vehicles,

S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,”CoRR, vol. abs/1705.05065, 2017. [Online]. Available: http://arxiv.org/abs/1705.05065

Pith/arXiv arXiv 2017

[29] [29]

Flightgoggles: Photorealistic sensor simulation for perception-driven robotics using photogrammetry and virtual reality,

W. Guerra, E. Tal, V. Murali, G. Ryou, and S. Karaman, “Flightgoggles: Photorealistic sensor simulation for perception-driven robotics using photogrammetry and virtual reality,” in2019 IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS), 2019, pp. 6941–6948

2019

[30] [30]

Video generation models as world simulators,

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh, “Video generation models as world simulators,” OpenAI Technical Report, 2024. [Online]. Available: https://openai.com/research/video-generation-models-as-world-simulators

2024

[31] [31]

2019 IEEE GRSS data fusion contest: Large-scale semantic 3d reconstruction,

B. Le Saux, N. Yokoya, R. Hansch, M. Brown, and G. Hager, “2019 IEEE GRSS data fusion contest: Large-scale semantic 3d reconstruction,”IEEE GRSS Magazine, 2019, worldView-3 multi-stereo satellite imagery, Jacksonville FL

2019

[32] [32]

From orbit to ground: Generative city photogrammetry from extreme off-nadir satellite images,

F. Yu, Y. Liu, L. Tang, M. Sun, Z. Ge, R. Bu, Y. Jin, H. Zhao, H. Sun, Y. Li, M. Xu, W. Chen, and B. Chen, “From orbit to ground: Generative city photogrammetry from extreme off-nadir satellite images,” 2026. [Online]. Available: https://arxiv.org/abs/2512.07527

Pith/arXiv arXiv 2026

[33] [33]

Capturing, reconstructing, and simulating: the urbanscene3d dataset,

L. Lin, Y. Liu, Y. Hu, X. Yan, K. Xie, and H. Huang, “Capturing, reconstructing, and simulating: the urbanscene3d dataset,” inEuropean Conference on Computer Vision (ECCV), 2022

2022

[34] [34]

Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs,

H. Turki, D. Ramanan, and M. Satyanarayanan, “Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022

[35] [35]

Drone-assisted road gaussian splatting with cross-view uncertainty,

S. Zhang, B. Ye, X. Chen, Y. Chen, Z. Zhang, C. Peng, Y. Shi, and H. Zhao, “Drone-assisted road gaussian splatting with cross-view uncertainty,” inarXiv preprint arXiv:2408.15242, 2024

arXiv 2024

[36] [36]

Urbanbis: a large-scale benchmark for fine-grained urban building instance segmentation,

G. Yang, F. Xue, Q. Zhang, K. Xie, C.-W. Fu, and H. Huang, “Urbanbis: a large-scale benchmark for fine-grained urban building instance segmentation,”ACM Transactions on Graphics (SIGGRAPH), vol. 42, no. 4, 2023

2023

[37] [37]

Crossloc: Scalable aerial localization assisted by multimodal synthetic data,

Q. Yan, J. Zheng, S. Reding, S. Li, and I. Doytchinov, “Crossloc: Scalable aerial localization assisted by multimodal synthetic data,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 16

2022

[38] [38]

Uavd4l: A large-scale dataset for uav 6-dof localization,

R. Wu, X. Cheng, J. Zhu, X. Liu, M. Zhang, and S. Yan, “Uavd4l: A large-scale dataset for uav 6-dof localization,” in International Conference on 3D Vision (3DV), 2024

2024

[39] [39]

Vision-based uav self-positioning in low-altitude urban environments,

M. Dai, E. Zheng, Z. Feng, L. Qi, J. Zhuang, and W. Yang, “Vision-based uav self-positioning in low-altitude urban environments,”IEEE Transactions on Image Processing, vol. 33, pp. 493–508, 2024

2024

[40] [40]

Clod-gs: Continuous level-of-detail via 3d gaussian splatting,

Z. Cheng, M. Sun, Y. Liu, Z. Ge, L. Tang, M. Xu, Y. Li, and P. Pan, “Clod-gs: Continuous level-of-detail via 3d gaussian splatting,” inInternational Conference on Learning Representations, 2025. [Online]. Available: https://arxiv.org/abs/2510.09997

arXiv 2025

[41] [41]

Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation,

T. H. Team, “Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation,” 2024

2024

[42] [42]

Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation,

Z. Wu, Y. Li, H. Yan, T. Shang, W. Sun, S. Wang, R. Cui, W. Liu, H. Sato, H. Liet al., “Blockfusion: Expandable 3d scene generation using latent tri-plane extrapolation,”ACM Transactionson Graphics (ToG), vol. 43, no. 4, pp. 1–17, 2024

2024

[43] [43]

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies,

X. Ren, J. Huang, X. Zeng, K. Museth, S. Fidler, and F. Williams, “Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 4209–4219

2024

[44] [44]

OGC Abstract Specification Topic 2: Referencing by Coordinates (OGC 18-005r8),

Open Geospatial Consortium, “OGC Abstract Specification Topic 2: Referencing by Coordinates (OGC 18-005r8),” Open Geospatial Consortium, Abstract Specification, 2023. [Online]. Available: https://docs.ogc.org/as/18-005r8/18-005r8.pdf

2023

[45] [45]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” inAdvancesin Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017

2017

[46] [46]

Demystifying mmd gans,

M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying mmd gans,” 2021. [Online]. Available: https://arxiv.org/abs/1801.01401

Pith/arXiv arXiv 2021

[47] [47]

Lrm: Large reconstruction model for single image to 3d,

Y. Hong, K. Zhang, J. Gu, S. Bi, Y. Zhou, D. Liu, F. Liu, K. Sunkavalli, T. Bui, and H. Tan, “Lrm: Large reconstruction model for single image to 3d,” inInternational Conference on Learning Representations, vol. 2024, 2024, pp. 50678–50702. 17

2024