Recognition: no theorem link
TerraSky3D: Multi-View Reconstructions of European Landmarks in 4K
Pith reviewed 2026-05-14 21:45 UTC · model grok-4.3
The pith
TerraSky3D supplies 50,000 high-resolution images across 150 European landmark scenes with calibrated poses and depth maps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present TerraSky3D, a dataset of 50,000 high-resolution 4K images divided into 150 ground, aerial, and mixed scenes focused on European landmarks, each accompanied by curated calibration data, camera poses, and depth maps, created specifically to support training and evaluation of advanced 3D reconstruction pipelines.
What carries the argument
TerraSky3D multi-view dataset, consisting of high-resolution images with associated geometric annotations that span ground, aerial, and combined capture altitudes.
Load-bearing premise
The collected images, poses, and depth maps maintain consistent quality, accuracy, and diversity across scenes.
What would settle it
An experiment in which state-of-the-art reconstruction networks trained on TerraSky3D show no measurable improvement in accuracy or completeness when tested on standard benchmarks compared with networks trained only on prior datasets.
Figures
read the original abstract
Despite the growing need for data of more and more sophisticated 3D reconstruction pipelines, we can still observe a scarcity of suitable public datasets. Existing 3D datasets are either low resolution, limited to a small amount of scenes, based on images of varying quality because retrieved from the internet, or limited to specific capturing scenarios. Motivated by this lack of suitable 3D datasets, we captured TerraSky3D, a high-resolution large-scale 3D reconstruction dataset comprising 50,000 images divided into 150 ground, aerial, and mixed scenes. The dataset focuses on European landmarks and comes with curated calibration data, camera poses, and depth maps. TerraSky3D tries to answer the need for challenging dataset that can be used to train and evaluate 3D reconstruction-related pipelines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents TerraSky3D, a new high-resolution 3D reconstruction dataset comprising 50,000 images across 150 ground, aerial, and mixed scenes of European landmarks, supplied with curated calibration data, camera poses, and depth maps to address the scarcity of suitable public datasets for training and evaluating sophisticated 3D pipelines.
Significance. A validated dataset of this scale and multi-view diversity could meaningfully advance 3D reconstruction research by providing challenging, landmark-focused captures that exceed the resolution or consistency limits of existing benchmarks. However, without demonstrated accuracy or comparative utility, the significance remains potential rather than established.
major comments (3)
- [Abstract] Abstract: The central claim that the dataset is 'curated' and suitable for training/evaluation rests on the unverified assertion of high-quality calibration, poses, and depth maps, yet the text supplies no capture protocols, reprojection errors, pose RMSE against independent SfM, depth RMSE against LiDAR/stereo, or any quantitative validation metrics.
- [Dataset description] Dataset description section: No ablation or histogram is provided on scene diversity (texture, lighting, viewpoint variation) or image quality statistics, which directly undermines the claim that TerraSky3D improves upon existing datasets limited by low resolution or internet-sourced variability.
- [Evaluation] Evaluation or experiments section: The manuscript reports no baseline 3D reconstruction results (e.g., on COLMAP or learned methods) or comparisons against Tanks & Temples / ETH3D, leaving the asserted utility for pipelines unsupported by evidence.
minor comments (2)
- [Dataset] Add a table breaking down the 150 scenes by category (ground/aerial/mixed) with exact image counts per scene to improve transparency.
- [Capture protocol] Clarify the 4K resolution claim with explicit pixel dimensions and any downsampling applied during capture or processing.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript introducing the TerraSky3D dataset. We address each major comment below and will incorporate revisions to strengthen the description and validation of the data.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the dataset is 'curated' and suitable for training/evaluation rests on the unverified assertion of high-quality calibration, poses, and depth maps, yet the text supplies no capture protocols, reprojection errors, pose RMSE against independent SfM, depth RMSE against LiDAR/stereo, or any quantitative validation metrics.
Authors: We acknowledge that the original abstract and text did not include these quantitative details. In the revised manuscript we will expand the methods section with capture protocols (camera models, acquisition setup, and synchronization) and report specific metrics including average reprojection error from bundle adjustment, pose RMSE from cross-validation against independent SfM runs on overlapping subsets, and depth consistency RMSE measured against multi-view stereo reconstructions. LiDAR ground truth was not collected during acquisition, so direct LiDAR RMSE is unavailable; we will instead emphasize the stereo-based validation. revision: yes
-
Referee: [Dataset description] Dataset description section: No ablation or histogram is provided on scene diversity (texture, lighting, viewpoint variation) or image quality statistics, which directly undermines the claim that TerraSky3D improves upon existing datasets limited by low resolution or internet-sourced variability.
Authors: We agree that explicit statistics would better substantiate the advantages over prior datasets. The revision will add a new subsection with histograms and summary tables quantifying scene diversity: texture complexity via gradient magnitude distributions, lighting variation across capture times and conditions, viewpoint coverage (altitude, azimuth, and elevation ranges), and image quality metrics such as average sharpness scores and resolution uniformity across the 50,000 images. revision: yes
-
Referee: [Evaluation] Evaluation or experiments section: The manuscript reports no baseline 3D reconstruction results (e.g., on COLMAP or learned methods) or comparisons against Tanks & Temples / ETH3D, leaving the asserted utility for pipelines unsupported by evidence.
Authors: The manuscript is structured as a dataset release paper rather than a methods benchmark. To directly address the concern, we will add a concise evaluation subsection demonstrating baseline usability: COLMAP reconstructions on a representative subset of scenes with reported completeness and accuracy metrics, plus a side-by-side comparison table highlighting TerraSky3D's higher resolution, landmark focus, and multi-view (ground/aerial) diversity relative to Tanks & Temples and ETH3D. revision: yes
Circularity Check
Dataset release paper exhibits no circularity
full rationale
The manuscript is a dataset release paper that describes the capture and curation of 50,000 images across 150 scenes, along with provided calibration data, camera poses, and depth maps. There are no derivations, equations, predictions, fitted parameters, or load-bearing claims that reduce by construction to the paper's own inputs. The central assertion rests solely on the existence of the collected data rather than any self-referential reasoning, self-citation chains, or ansatz smuggling. No steps qualify as circular under the enumerated patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
HPatches: A benchmark and evaluation of handcrafted and learned local descriptors
Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5173–5182. IEEE, 2017. 3
work page 2017
- [2]
-
[3]
Rdd: Robust feature detector and descriptor using deformable transformer
Gonglin Chen, Tianwen Fu, Haiwei Chen, Wenbin Teng, Hanyuan Xiao, and Yajie Zhao. Rdd: Robust feature detector and descriptor using deformable transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6394–6403, 2025. 2, 3, 5, 6
work page 2025
-
[4]
Masked-attention mask trans- former for universal image segmentation
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask trans- former for universal image segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1290–1300, 2022. 4
work page 2022
-
[5]
Superpoint: Self-supervised interest point detection and description
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018. 5, 6
work page 2018
-
[6]
A stream- lined attention-based network for descriptor extraction
Mattia D’Urso, Emanuele Santellani, Christian Sormann, Mat- tia Rossi, Andreas Kuhn, and Friedrich Fraundorfer. A stream- lined attention-based network for descriptor extraction. In 2026 International Conference on 3D Vision (3DV). IEEE Computer Society, 2026. 3, 5, 6
work page 2026
-
[7]
Johan Edstedt, Georg B ¨okman, M ˚arten Wadenb ¨ack, and Michael Felsberg. Dedode: Detect, don’t describe–describe, don’t detect for local feature matching.arXiv preprint arXiv:2308.08479, 2023. 5, 6
-
[8]
Roma: Robust dense feature matching
Johan Edstedt, Qiyu Sun, Georg B¨okman, M˚arten Wadenb¨ack, and Michael Felsberg. Roma: Robust dense feature matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19790–19800, 2024. 5, 6
work page 2024
-
[9]
Roma v2: Harder better faster denser feature matching.arXiv preprint arXiv:2511.15706, 2025
Johan Edstedt, David Nordstr ¨om, Yushan Zhang, Georg B¨okman, Jonathan Astermark, Viktor Larsson, Anders Hey- den, Fredrik Kahl, M˚arten Wadenb¨ack, and Michael Felsberg. Roma v2: Harder better faster denser feature matching.arXiv preprint arXiv:2511.15706, 2025. 2
-
[10]
Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image match- ing across wide baselines: From paper to practice.Interna- tional Journal of Computer Vision, 129(2):517–547, 2021. 2, 3, 7
work page 2021
-
[11]
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d recon- struction.arXiv preprint arXiv:2509.13414, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023. 6
work page 2023
-
[13]
Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics, 36(4), 2017. 4
work page 2017
-
[14]
Ground- ing image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 5, 6
work page 2024
-
[15]
Fastmap: Revisiting dense and scal- able structure from motion.arXiv preprint arXiv:2505.04612,
Jiahao Li, Haochen Wang, Muhammad Zubair Irshad, Igor Vasiljevic, Matthew R Walter, Vitor Campagnolo Guizilini, and Greg Shakhnarovich. Fastmap: Revisiting dense and scal- able structure from motion.arXiv preprint arXiv:2505.04612,
-
[16]
Yaxuan Li, Yewei Huang, Bijay Gaudel, Hamidreza Ja- farnejadsani, and Brendan Englot. Cvd-sfm: A cross- view deep front-end structure-from-motion system for sparse localization in multi-altitude scenes.arXiv preprint arXiv:2508.01936, 2025. 2, 3
-
[17]
Megadepth: Learning single- view depth prediction from internet photos
Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 2041–2050, 2018. 1, 2, 3, 7
work page 2041
-
[18]
Lightglue: Local feature matching at light speed
Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF international conference on computer vision, pages 17627–17638, 2023. 5, 6
work page 2023
-
[19]
Global structure-from-motion revisited
Linfei Pan, D´aniel Bar´ath, Marc Pollefeys, and Johannes L Sch¨onberger. Global structure-from-motion revisited. InEuro- pean Conference on Computer Vision, pages 58–77. Springer,
-
[20]
Revisiting oxford and paris: Large-scale image retrieval benchmarking
Filip Radenovi ´c, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ond ˇrej Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 5706–5715, 2018. 3
work page 2018
-
[21]
S-trek: Sequential translation and rotation equivariant keypoints for local feature extraction
Emanuele Santellani, Christian Sormann, Mattia Rossi, An- dreas Kuhn, and Friedrich Fraundorfer. S-trek: Sequential translation and rotation equivariant keypoints for local feature extraction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9728–9737, 2023. 6
work page 2023
-
[22]
Benchmarking 6dof outdoor visual localization in changing conditions
Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Oku- tomi, Marc Pollefeys, Josef Sivic, et al. Benchmarking 6dof outdoor visual localization in changing conditions. InPro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 8601–8610, 2018. 3
work page 2018
-
[23]
Structure- from-motion revisited
Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 4104– 4113, 2016. 2, 4
work page 2016
-
[24]
A multi-view stereo benchmark with high- resolution images and multi-camera videos
Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 3260–3269, 2017. 2, 4
work page 2017
-
[25]
Loftr: Detector-free local feature matching with transformers
Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xi- aowei Zhou. Loftr: Detector-free local feature matching with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931,
-
[26]
24/7 place recognition by view synthesis
Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 24/7 place recognition by view synthesis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1808–1817, 2015. 3
work page 2015
-
[27]
Megascenes: Scene-level view synthesis at scale
Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. In ECCV, 2024. 1, 2, 3
work page 2024
-
[28]
Michał Tyszkiewicz, Pascal Fua, and Eduard Trulls. Disk: Learning local features with policy gradient.Advances in neu- ral information processing systems, 33:14254–14265, 2020. 5, 6
work page 2020
-
[29]
Aerialmegadepth: Learn- ing aerial-ground reconstruction and view synthesis
Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, and Shubham Tulsiani. Aerialmegadepth: Learn- ing aerial-ground reconstruction and view synthesis. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 21674–21684, 2025. 2, 3
work page 2025
-
[30]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 4, 6
work page 2025
-
[31]
Adaptive patch deformation for textureless-resilient multi-view stereo
Yuesong Wang, Zhaojie Zeng, Tao Guan, Wei Yang, Zhuo Chen, Wenkai Liu, Luoyuan Xu, and Yawei Luo. Adaptive patch deformation for textureless-resilient multi-view stereo. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1621–1630,
-
[32]
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 6
work page 2018
-
[34]
Xiaoming Zhao, Xingming Wu, Weihai Chen, Peter CY Chen, Qingsong Xu, and Zhengguo Li. Aliked: A lighter keypoint and descriptor extraction network via deformable transforma- tion.IEEE Transactions on Instrumentation and Measure- ment, 72:1–16, 2023. 5, 6
work page 2023
-
[35]
Xinyi Zheng, Steve Zhang, Weizhe Lin, Aaron Zhang, Wal- terio W Mayol-Cuevas, and Junxiao Shen. Culture3d: Cul- tural landmarks and terrain dataset for 3d applications.arXiv preprint arXiv:2501.06927, 2025. 2, 3
-
[36]
University-1652: A multi-view multi-source benchmark for drone-based geo- localization
Zhedong Zheng, Yunchao Wei, and Yi Yang. University-1652: A multi-view multi-source benchmark for drone-based geo- localization. InProceedings of the 28th ACM international conference on Multimedia, pages 1395–1403, 2020. 2, 3
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.