arxiv: 2604.04667 · v2 · submitted 2026-04-06 · 💻 cs.CV · cs.LG· cs.RO

Recognition: unknown

ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging

Selim Ahmet Iz , Francesco Nex , Norman Kerle , Henry Meissner , Ralf Berger

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:09 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.RO

keywords zero-shot depth estimationbundle adjustmentUAV imageryreal-time mappingdiffusion modelsmetric depthaerial photogrammetry

0 comments

The pith

Bundle adjustment on reprojected tie-points turns zero-shot diffusion depth estimates into metrically consistent real-time UAV maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to make probabilistic zero-shot depth predictions from diffusion models metrically reliable for streaming ultra-high-resolution UAV imagery. Frames are grouped into overlapping clusters, incremental bundle adjustment computes consistent poses and sparse 3D points, and those points are reprojected to guide the depth model on selected frames. This hybrid corrects the lack of scale and temporal consistency that pure diffusion outputs suffer from, while avoiding the need for retraining or dense multi-view stereo geometry. The approach targets time-critical uses such as disaster mapping where both speed and sub-meter accuracy matter.

Core claim

ZeD-MAP converts a test-time diffusion depth model into a SLAM-like mapping pipeline by integrating incremental cluster-based bundle adjustment. Streamed UAV frames are grouped into overlapping clusters; periodic BA produces metrically consistent poses and sparse 3D tie-points, which are reprojected into selected frames and used as metric guidance for diffusion-based depth estimation. Validation on ground-marker flights at approximately 50 m altitude shows sub-meter accuracy with 0.87 m horizontal and 0.12 m vertical error at per-image runtimes of 1.47 to 4.91 seconds.

What carries the argument

Incremental cluster-based bundle adjustment that reprojects sparse tie-points as metric guidance to correct the probabilistic outputs of zero-shot diffusion depth models.

If this is right

Real-time 3D map generation becomes feasible from ultra-high-resolution UAV streams without task-specific retraining or dense multi-view stereo.
Temporal consistency across sequential frames and overlapping tiles reaches levels comparable to classical photogrammetry at much higher speed.
The method handles wide-baseline parallax, low-texture surfaces, specular areas, and occlusions through the added metric constraints.
Per-image processing stays within 1.5 to 5 seconds, enabling deployment under strict computational limits for time-critical geospatial tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same guidance mechanism could be tested on other zero-shot depth predictors to check whether bundle adjustment works as a general metric regularizer.
Replacing periodic clustering with continuous online bundle adjustment might further reduce latency while preserving accuracy.
Extending the re-projection guidance to include surface normals or semantic labels from the same diffusion model could improve performance on thin structures and vegetation.

Load-bearing premise

Reprojected sparse tie-points from cluster bundle adjustment supply enough unbiased metric information to correct diffusion depth outputs across different textures and occlusions.

What would settle it

Direct comparison of the output point clouds against the same manual ground-marker annotations on the MACS flights would falsify the claim if average horizontal errors exceed 1 m or vertical errors exceed 0.5 m.

read the original abstract

Real-time depth reconstruction from ultra-high-resolution UAV imagery is essential for time-critical geospatial tasks such as disaster response, yet remains challenging due to wide-baseline parallax, large image sizes, low-texture or specular surfaces, occlusions, and strict computational constraints. Recent zero-shot diffusion models offer fast per-image dense predictions without task-specific retraining, and require fewer labelled datasets than transformer-based predictors while avoiding the rigid capture geometry requirement of classical multi-view stereo. However, their probabilistic inference prevents reliable metric accuracy and temporal consistency across sequential frames and overlapping tiles. We present ZeD-MAP, a cluster-level framework that converts a test-time diffusion depth model into a metrically consistent, SLAM-like mapping pipeline by integrating incremental cluster-based bundle adjustment (BA). Streamed UAV frames are grouped into overlapping clusters; periodic BA produces metrically consistent poses and sparse 3D tie-points, which are reprojected into selected frames and used as metric guidance for diffusion-based depth estimation. Validation on ground-marker flights captured at approximately 50 m altitude (GSD is approximately 0.85 cm/px, corresponding to 2,650 square meters ground coverage per frame) with the DLR Modular Aerial Camera System (MACS) shows that our method achieves sub-meter accuracy, with approximately 0.87 m error in the horizontal (XY) plane and 0.12 m in the vertical (Z) direction, while maintaining per-image runtimes between 1.47 and 4.91 seconds. Results are subject to minor noise from manual point-cloud annotation. These findings show that BA-based metric guidance provides consistency comparable to classical photogrammetric methods while significantly accelerating processing, enabling real-time 3D map generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZeD-MAP shows how cluster bundle adjustment can supply metric guidance to an off-the-shelf zero-shot diffusion depth model for UAV sequences, delivering usable sub-meter numbers at 1.5-5s per frame, though the reference data quality limits how firmly the accuracy claim lands.

read the letter

The core contribution is a pipeline that groups incoming UAV frames into overlapping clusters, runs incremental bundle adjustment to get consistent poses and sparse tie-points, then reprojects those points to steer the diffusion depth estimates toward metric scale without retraining the model. That integration is new for aerial work and keeps the per-image speed of the zero-shot predictor while adding frame-to-frame consistency that pure diffusion outputs lack. The reported runtimes and the 0.87 m XY / 0.12 m Z errors on the MACS ground-marker flights at 50 m altitude give a concrete sense of what the system can do for time-sensitive mapping tasks. The cluster BA approach looks efficient for streaming data and avoids the full global optimization that would break real-time constraints. Credit is due for testing on real high-resolution imagery rather than just simulation. The main soft spot is the validation. The reference point clouds are manually annotated, and the abstract only notes minor noise without giving a bound on annotation precision or any cross-check against LiDAR or RTK-GPS. If that annotation error sits near the reported figures, it becomes hard to isolate how much the reprojected tie-points actually improve the depth model, especially in low-texture or occluded patches. The paper also does not break out results by scene difficulty, so the central claim of reliable sub-meter metric output rests on a single reference source whose uncertainty is not quantified. This work is aimed at teams building real-time UAV mapping systems who already have access to a diffusion depth model and want to add metric consistency with classical geometry tools. A reader focused on photogrammetry or lightweight SLAM for drones would find the integration details and runtime numbers useful. I would send it to peer review; the idea is practical, the experiments use real data, and the remaining questions about reference quality are the sort of thing referees can push on in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ZeD-MAP, a cluster-level pipeline that groups UAV image streams into overlapping clusters, runs incremental bundle adjustment to recover metrically consistent poses and sparse tie-points, and reprojects those tie-points to guide per-frame zero-shot diffusion depth estimation, thereby converting probabilistic depth predictions into temporally consistent metric depth maps. Validation on ground-marker flights at approximately 50 m altitude using the DLR MACS system reports sub-meter accuracy (0.87 m horizontal, 0.12 m vertical) with per-image runtimes between 1.47 and 4.91 seconds.

Significance. If the accuracy claims hold under independent verification, the work would demonstrate a practical route to real-time metric 3D mapping from high-resolution UAV imagery by fusing classical photogrammetric constraints with fast zero-shot models, offering a speed advantage over full multi-view stereo while preserving metric fidelity needed for disaster-response and geospatial tasks.

major comments (2)

[Abstract] Abstract: the reported 0.87 m XY / 0.12 m Z errors are measured against manually annotated point clouds, yet no quantitative bound is supplied on annotation precision, no error bars are given, and no independent reference (LiDAR or RTK-GPS) is described. If annotation noise is comparable to the stated figures, the experiment cannot establish that reprojected BA tie-points deliver unbiased metric correction to the diffusion outputs.
[Method] Method description (cluster BA guidance): the central assumption that sparse reprojected tie-points suffice to correct diffusion depth estimates across low-texture and occluded regions is stated but not supported by an ablation that isolates the guidance term or quantifies residual bias after correction.

minor comments (2)

[Abstract] The abstract states GSD is approximately 0.85 cm/px and ground coverage 2,650 m² per frame; these values should be cross-checked against the stated 50 m altitude and focal length for internal consistency.
[Implementation] No mention of the specific diffusion model checkpoint or guidance scale used; these hyperparameters should be listed to enable reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript accordingly to improve clarity and support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the reported 0.87 m XY / 0.12 m Z errors are measured against manually annotated point clouds, yet no quantitative bound is supplied on annotation precision, no error bars are given, and no independent reference (LiDAR or RTK-GPS) is described. If annotation noise is comparable to the stated figures, the experiment cannot establish that reprojected BA tie-points deliver unbiased metric correction to the diffusion outputs.

Authors: We acknowledge the validity of this concern. The current manuscript notes minor noise from manual annotation but does not quantify its precision or provide error bars. In the revised version we will expand the experimental section with a detailed description of the annotation protocol (including repeated annotations by multiple operators to estimate inter-annotator variability) and will report corresponding precision bounds. We will also add explicit discussion of this as a limitation. The DLR MACS dataset used for validation does not contain LiDAR or RTK-GPS references, so we cannot supply an independent metric reference; we will state this limitation clearly while emphasizing that the reported figures demonstrate relative consistency with the best available ground truth for the given flights. revision: partial
Referee: [Method] Method description (cluster BA guidance): the central assumption that sparse reprojected tie-points suffice to correct diffusion depth estimates across low-texture and occluded regions is stated but not supported by an ablation that isolates the guidance term or quantifies residual bias after correction.

Authors: The referee correctly identifies that the manuscript states the guidance assumption without an isolating ablation. We will add a dedicated ablation study in the revised manuscript that compares zero-shot diffusion depth outputs with and without the reprojected BA tie-point guidance. The ablation will report quantitative metrics (e.g., RMSE against annotated markers) on low-texture and occluded regions to measure residual bias reduction attributable to the guidance term. This addition will directly substantiate the central claim. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain is self-contained with independent validation

full rationale

The paper presents ZeD-MAP as a pipeline that applies incremental cluster bundle adjustment to produce metrically consistent poses and sparse tie-points, which are then reprojected to guide (not retrain) a zero-shot diffusion depth model. No equations or steps reduce a claimed prediction to a fitted parameter by construction, nor does any load-bearing premise collapse to a self-citation chain. The reported sub-meter errors are obtained from external comparison against manually annotated point clouds rather than being algebraically forced by the method's own inputs. The derivation therefore remains independent of its outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard photogrammetric assumptions plus the paper-specific premise that sparse tie-points can steer diffusion outputs to metric scale without retraining.

axioms (2)

domain assumption Bundle adjustment on overlapping image clusters produces metrically consistent poses and sparse 3D tie-points.
Invoked when periodic BA is described as producing guidance for diffusion depth estimation.
ad hoc to paper Reprojected tie-points can be used as reliable metric guidance to correct probabilistic diffusion depth estimates.
Core mechanism of the ZeD-MAP guidance step.

pith-pipeline@v0.9.0 · 5633 in / 1352 out tokens · 41424 ms · 2026-05-14T21:09:08.413837+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

[1]

DeTone, D., Malisiewicz, T., & Rabinovich, A. (2018). SuperPoint: Self -supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 224–236)

work page 2018
[2]

Palmer, J. L. (2021). ColMap: A memory -efficient occupancy grid mapping framework. Robotics and Autonomous Systems, 142, 103755

work page 2021
[3]

Guo, H., Zhu, H., Peng, S., Lin, H., Yan, Y., Xie, T., … & Bao, H. (2025). Multi-view reconstruction via SfM-guided monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5272–5282)

work page 2025
[4]

Hein, D., & Berger, R. (2018). Terrain aware image clipping for real-time aerial mapping. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 4(1), 61–68. Hirschmüller, H. (2005, June). Accurate and efficient stereo processing by semi-global matching and mutual information. In 2005 IEEE Computer Society Conference on Compu...

work page 2018
[5]

A., & Unel, M

Iz, S. A., & Unel, M. (2023, July). Aerial image stitching using IMU data from a UAV. In 2023 8th International Conference on

work page 2023
[6]

A., Nex, F., Kerle, N., Meissner, H., & Berger, R

Iz, S. A., Nex, F., Kerle, N., Meissner, H., & Berger, R. (2025). Real-Time Bundle Adjustment for Ultra -High-Resolution UAV Imagery Using Adaptive Patch -Based Feature Tracking. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 10, 73-80

work page 2025
[7]

Ke, B., Narnhofer, D., Huang, S., Ke, L., Peters, T., Fragkiadaki, K., … & Schindler, K. (2025). Video depth without video models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7233–7243)

work page 2025
[8]

Fischer, T., … & Kontschieder, P. (2025). MapAnything: Universal feed -forward metric 3D reconstruction. arXiv:2509.13414

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

E., & Pollefeys, M

Lindenberger, P., Sarlin, P. E., & Pollefeys, M. (2023). LightGlue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 17627–17638)

work page 2023
[10]

Lowe, D. G. (2004). Distinctive image features from scale - invariant keypoints. International Journal of Computer Vision, 60(2), 91–110

work page 2004
[11]

Maggio, D., Lim, H., & Carlone, L. (2025). VGGT -SLAM: Dense RGB SLAM optimized on the SL(4) manifold. arXiv:2505.12549

work page arXiv 2025
[12]

(2011, November)

Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. (2011, November). ORB: An efficient alternative to SIFT or SURF. In 2011 International Conference on Computer Vision (pp. 2564 – 2571). IEEE

work page 2011
[13]

E., DeTone, D., Malisiewicz, T., & Rabinovich, A

Sarlin, P. E., DeTone, D., Malisiewicz, T., & Rabinovich, A. (2020). SuperGlue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4938–4947)

work page 2020
[14]

Seki, A., & Pollefeys, M. (2017). SGM -Nets: Semi -global matching with neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 231–240)

work page 2017
[15]

Tang, J., Gao, Y., Yang, D., Yan, L., Yue, Y., & Yang, Y. (2025). DroneSplat: 3D Gaussian splatting for robust 3D reconstruction from in -the-wild drone imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 833–843)

work page 2025
[16]

Tang, W., Lin, Z., & Gong, Y. (2023). GC-Net: An unsupervised network for Gaussian curvature optimization on images. Journal of Signal Processing Systems, 95(1), 77–88

work page 2023
[17]

Viola, M., Qu, K., Metzger, N., Ke, B., Becker, A., Schindler, K., & Obukhov, A. (2025). Marigold -DC: Zero -shot monocular depth completion with guided diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5359–5370)

work page 2025
[18]

Wang, H., Hutchcroft, W., Li, Y., Wan, Z., Boyadzhiev, I., Tian, Y., & Kang, S. B. (2022). PSMNet: Position -aware stereo merging network for room layout estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8616–8625)

work page 2022
[19]

Novotny, D. (2025). VGGT: Visual geometry grounded transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5294–5306)

work page 2025
[20]

A., & Kanazawa, A

Wang, Q., Zhang, Y., Holynski, A., Efros, A. A., & Kanazawa, A. (2025). Continuous 3D perception model with persistent state. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10510–10522)

work page 2025