pith. machine review for the scientific record. sign in

arxiv: 2604.03120 · v1 · submitted 2026-04-03 · 💻 cs.CV · cs.RO

Recognition: 1 theorem link

· Lean Theorem

SCC-Loc: A Unified Semantic Cascade Consensus Framework for UAV Thermal Geo-Localization

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:30 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords thermal geo-localizationUAV navigationcross-modal matchingDINOv2semantic cascadesatellite alignmentGNSS-denied positioning
0
0 comments X

The pith

SCC-Loc achieves 9.37 m mean error in UAV thermal geo-localization by sharing one DINOv2 backbone across retrieval and matching

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the severe feature ambiguity that arises when matching thermal UAV images to visible satellite references for absolute positioning without GNSS. It introduces SCC-Loc, a framework that reuses a single DINOv2 backbone for both coarse retrieval and fine matching while adding three modules to correct spatial deviations, remove cross-modal outliers, and select reliable poses. The approach also releases the Thermal-UAV dataset of 11,890 thermal queries paired with satellite ortho-photos and aligned DSMs. Experiments report new state-of-the-art results, cutting mean localization error to 9.37 m and raising accuracy inside a 5 m threshold by a factor of 7.6 over the prior best method. A sympathetic reader would care because the work targets reliable all-weather UAV operation in signal-denied environments where conventional registration fails.

Core claim

SCC-Loc performs accurate absolute position estimation from thermal UAV images against satellite references by sharing DINOv2 features and chaining semantic-guided viewport alignment, cascaded spatial-adaptive texture-structure filtering, and consensus-driven reliability-aware position selection to resolve modality gaps.

What carries the argument

The unified Semantic-Cascade-Consensus framework that shares a DINOv2 backbone and deploys SGVA for adaptive crop alignment, C-SATSF for geometric outlier removal, and CD-RAPS for physically constrained pose optimization.

Load-bearing premise

The semantic features from DINOv2 combined with the three proposed modules will generalize to new thermal-visible pairs without overfitting to the specific alignment quality or scene statistics of the Thermal-UAV dataset.

What would settle it

Evaluating SCC-Loc on an independent thermal UAV dataset collected in a different geographic region or under different thermal conditions and checking whether the reported 9.37 m mean error and 7.6-fold tight-threshold gain are preserved.

Figures

Figures reproduced from arXiv: 2604.03120 by Huaxin Xiao, Jinyu Liang, Kangqiushi Li, Xiaoran Zhang, Yu Liu, Zhiwei Huang.

Figure 1
Figure 1. Figure 1: Conceptual comparison of the existing paradigm and our SCC-Loc [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed SCC-Loc framework. The pipeline consists of four main stages: [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual samples from the constructed Thermal-UAV dataset. The dataset systematically captures profound modality discrepancies and diurnal thermal [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative visualization of the proposed SCC-Loc pipeline in (a) Urban and (b) Rural scenarios. The process illustrates the adaptive correction of spatial [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

Cross-modal Thermal Geo-localization (TG) provides a robust, all-weather solution for Unmanned Aerial Vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments. However, profound thermal-visible modality gaps introduce severe feature ambiguity, systematically corrupting conventional coarse-to-fine registration. To dismantle this bottleneck, we propose SCC-Loc, a unified Semantic-Cascade-Consensus localization framework. By sharing a single DINOv2 backbone across global retrieval and MINIMA$_{\text{RoMa}}$ matching, it minimizes memory footprint and achieves zero-shot, highly accurate absolute position estimation. Specifically, we tackle modality ambiguity by introducing three cohesive components. First, we design the Semantic-Guided Viewport Alignment (SGVA) module to adaptively optimize satellite crop regions, effectively correcting initial spatial deviations. Second, we develop the Cascaded Spatial-Adaptive Texture-Structure Filtering (C-SATSF) mechanism to explicitly enforce geometric consistency, thereby eradicating dense cross-modal outliers. Finally, we propose the Consensus-Driven Reliability-Aware Position Selection (CD-RAPS) strategy to derive the optimal solution through a synergy of physically constrained pose optimization. To address data scarcity, we construct Thermal-UAV, a comprehensive dataset providing 11,890 diverse thermal queries referenced against a large-scale satellite ortho-photo and corresponding spatially aligned Digital Surface Model (DSM). Extensive experiments demonstrate that SCC-Loc establishes a new state-of-the-art, suppressing the mean localization error to 9.37 m and providing a 7.6-fold accuracy improvement within a strict 5-m threshold over the strongest baseline. Code and dataset are available at https://github.com/FloralHercules/SCC-Loc.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents SCC-Loc, a unified semantic cascade consensus framework for cross-modal UAV thermal geo-localization. It shares a single DINOv2 backbone for global retrieval and MINIMA_RoMa matching, and introduces three modules (SGVA for adaptive satellite crop optimization, C-SATSF for enforcing geometric consistency via cascaded filtering, and CD-RAPS for reliability-aware pose selection). The authors release the Thermal-UAV dataset (11,890 thermal queries aligned to satellite ortho-photos and DSM) and report SOTA results of 9.37 m mean localization error with a 7.6-fold accuracy gain inside a strict 5 m threshold.

Significance. If the gains hold beyond the new dataset, the work provides a practical, memory-efficient advance for GNSS-denied UAV navigation by directly tackling thermal-visible feature ambiguity with semantic features and consensus optimization. The open release of code and dataset, together with the zero-shot use of a shared backbone, are clear strengths that support reproducibility and follow-on research.

major comments (1)
  1. [Experiments section] Experiments section: all quantitative results, including the reported 9.37 m mean error and 7.6-fold improvement at the 5 m threshold, are obtained exclusively on the newly introduced Thermal-UAV dataset. To substantiate the claim of solving the general modality-gap problem rather than exploiting collection-specific regularities (viewport statistics, DSM alignment quality, or thermal-visible pair construction), evaluation on at least one established prior TG benchmark or a held-out geographic split is required.
minor comments (1)
  1. [Method section] The integration of MINIMA_RoMa with the DINOv2 backbone should be described with a brief equation or pseudocode in the method section for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on generalizability. We agree that all current quantitative results are reported exclusively on the new Thermal-UAV dataset and that additional validation is required to demonstrate that SCC-Loc addresses the modality gap in a general manner rather than dataset-specific artifacts. In the revised manuscript we will add a geographically held-out split evaluation (no location overlap between training and test sets) and will explicitly discuss the limitations of relying solely on the new dataset.

read point-by-point responses
  1. Referee: [Experiments section] Experiments section: all quantitative results, including the reported 9.37 m mean error and 7.6-fold improvement at the 5 m threshold, are obtained exclusively on the newly introduced Thermal-UAV dataset. To substantiate the claim of solving the general modality-gap problem rather than exploiting collection-specific regularities (viewport statistics, DSM alignment quality, or thermal-visible pair construction), evaluation on at least one established prior TG benchmark or a held-out geographic split is required.

    Authors: We acknowledge the validity of this concern. The Thermal-UAV dataset was constructed specifically to fill the data gap in thermal-to-satellite geo-localization, and all reported metrics (9.37 m mean error, 7.6-fold gain at 5 m) are indeed obtained on this new collection. In the revision we will add a held-out geographic split experiment: the dataset will be partitioned by geographic region so that test queries come from entirely unseen locations, thereby testing robustness to different viewport statistics and DSM characteristics. We will report the same metrics on this split and include an analysis of any performance drop. While we agree an established prior TG benchmark would be ideal, most existing cross-modal geo-localization datasets are either not publicly released, use different sensor modalities, or lack aligned DSMs; we will note this limitation and state that future work will seek compatible benchmarks when they become available. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces SCC-Loc as a framework combining a shared DINOv2 backbone with three new modules (SGVA, C-SATSF, CD-RAPS) and evaluates performance empirically on the newly constructed Thermal-UAV dataset. No mathematical derivations, equations, or predictions are presented that reduce by construction to the method's own inputs or fitted parameters. Performance numbers (e.g., 9.37 m mean error) are reported as direct experimental outcomes rather than outputs of any self-referential fitting process. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the provided text. The central claims rest on external pre-trained features and measured results on the introduced data, making the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on the standard assumption that DINOv2 features transfer across thermal-visible domains and on the domain assumption that satellite ortho-photos plus DSM provide reliable geometric ground truth. No new physical entities are postulated.

axioms (2)
  • domain assumption DINOv2 features are sufficiently invariant to thermal-visible modality shift for both retrieval and dense matching
    Invoked when sharing the single backbone across global retrieval and MINIMA_RoMa matching stages
  • domain assumption Semantic cues can reliably correct initial spatial deviations between thermal query and satellite reference
    Core premise of the SGVA module

pith-pipeline@v0.9.0 · 5633 in / 1482 out tokens · 26645 ms · 2026-05-13T19:30:49.467722+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    By sharing a single DINOv2 backbone across global retrieval and MINIMA RoMa matching... Semantic-Guided Viewport Alignment (SGVA) module... Cascaded Spatial-Adaptive Texture-Structure Filtering (C-SATSF)... Consensus-Driven Reliability-Aware Position Selection (CD-RAPS)

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 4 internal anchors

  1. [1]

    Search and rescue operation using uavs: A case study,

    I. Martinez-Alpiste, G. Golcarenarenji, Q. Wang, and J. M. Alcaraz- Calero, “Search and rescue operation using uavs: A case study,”Expert Syst. Appl., vol. 178, p. 114937, 2021

  2. [2]

    Drones and border control: An examination of state and non-state actor use of uavs along borders,

    R. Koslowski, “Drones and border control: An examination of state and non-state actor use of uavs along borders,” inResearch Handbook on International Migration and Digital Technology. Edward Elgar Publishing, 2021, pp. 152–165

  3. [3]

    University-1652: A multi-view multi- source benchmark for drone-based geo-localization,

    Z. Zheng, Y . Wei, and Y . Yang, “University-1652: A multi-view multi- source benchmark for drone-based geo-localization,” inProc. ACM Int. Conf. Multimedia, 2020, pp. 1395–1403

  4. [4]

    A review on deep learning for uav absolute visual localization,

    A. Couturier and M. A. Akhloufi, “A review on deep learning for uav absolute visual localization,”Drones, vol. 8, no. 11, p. 622, 2024

  5. [5]

    Long-range uav thermal geo-localization with satellite imagery,

    J. Xiao, D. Tortei, E. Roura, and G. Loianno, “Long-range uav thermal geo-localization with satellite imagery,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS). IEEE, 2023, pp. 5820–5827

  6. [6]

    Sthn: Deep homography estimation for uav thermal geo-localization with satellite imagery,

    J. Xiao, N. Zhang, D. Tortei, and G. Loianno, “Sthn: Deep homography estimation for uav thermal geo-localization with satellite imagery,”IEEE Robot. Autom. Lett., 2024

  7. [7]

    Uasthn: Uncertainty-aware deep homography estimation for uav satellite-thermal geo-localization,

    J. Xiao and G. Loianno, “Uasthn: Uncertainty-aware deep homography estimation for uav satellite-thermal geo-localization,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA). IEEE, 2025, pp. 14 066–14 072

  8. [8]

    Leveraging map retrieval and alignment for robust uav visual geo-localization,

    M. He, J. Liu, P. Gu, and Z. Meng, “Leveraging map retrieval and alignment for robust uav visual geo-localization,”IEEE Trans. Instrum. Meas., vol. 73, pp. 1–13, 2024

  9. [9]

    Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation Condition: a Benchmark

    Y . Ye, X. Teng, S. Chen, Z. Li, L. Liu, Q. Yu, and T. Tan, “Exploring the best way for uav visual localization under low-altitude multi-view observation condition: a benchmark,”arXiv:2503.10692, 2025

  10. [10]

    Airgeonet: A map-guided visual geo-localization approach for aerial vehicles,

    X. Meng, W. Guo, K. Zhou, T. Sun, L. Deng, S. Yu, and Y . Feng, “Airgeonet: A map-guided visual geo-localization approach for aerial vehicles,”IEEE Trans. Geosci. Remote Sens., 2024

  11. [11]

    Xoftr: Cross-modal feature matching transformer,

    ¨O. Tuzcuo˘glu, A. K ¨oksal, B. Sofu, S. Kalkan, and A. A. Alatan, “Xoftr: Cross-modal feature matching transformer,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 4275–4286

  12. [12]

    Uav- geoloc: A large-vocabulary dataset and geometry-transformed method for uav geo-localization,

    R. Wu, J. Deng, M. Mou, X. He, M. Zhang, Y . Liu, and S. Yan, “Uav- geoloc: A large-vocabulary dataset and geometry-transformed method for uav geo-localization,”IEEE Robot. Autom. Lett., 2025

  13. [13]

    Game4loc: A uav geo-localization benchmark from game data,

    Y . Ji, B. He, Z. Tan, and L. Wu, “Game4loc: A uav geo-localization benchmark from game data,” inProc. AAAI Conf. Artif. Intell., vol. 39, no. 4, 2025, pp. 3913–3921

  14. [14]

    Uav-visloc: A large-scale dataset for uav visual localization,

    W. Xu, Y . Yao, J. Cao, Z. Wei, C. Liu, J. Wang, and M. Peng, “Uav-visloc: A large-scale dataset for uav visual localization,” arXiv:2405.11936, 2024

  15. [15]

    Sues-200: A multi-height multi-scene cross-view image benchmark across drone and satellite,

    R. Zhu, L. Yin, M. Yang, F. Wu, Y . Yang, and W. Hu, “Sues-200: A multi-height multi-scene cross-view image benchmark across drone and satellite,”IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 9, pp. 4825–4839, 2023

  16. [16]

    Vision- based uav self-positioning in low-altitude urban environments,

    M. Dai, E. Zheng, Z. Feng, L. Qi, J. Zhuang, and W. Yang, “Vision- based uav self-positioning in low-altitude urban environments,”IEEE Trans. Image Process., vol. 33, pp. 493–508, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15

  17. [17]

    Uav geo-localization for navigation: A survey,

    D. Avola, L. Cinque, E. Emam, F. Fontana, G. L. Foresti, M. R. Marini, A. Mecca, and D. Pannone, “Uav geo-localization for navigation: A survey,”IEEE Access, 2024

  18. [18]

    Mmgeo: Multimodal compositional geo-localization for uavs,

    Y . Ji, B. He, Z. Tan, and L. Wu, “Mmgeo: Multimodal compositional geo-localization for uavs,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2025, pp. 25 165–25 175

  19. [19]

    Netvlad: Cnn architecture for weakly supervised place recognition,

    R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 5297–5307

  20. [20]

    A transformer-based fea- ture segmentation and region alignment method for uav-view geo- localization,

    M. Dai, J. Hu, J. Zhuang, and E. Zheng, “A transformer-based fea- ture segmentation and region alignment method for uav-view geo- localization,”IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 7, pp. 4376–4389, 2021

  21. [21]

    Camp: A cross-view geo-localization method using contrastive attributes mining and position-aware partitioning,

    Q. Wu, Y . Wan, Z. Zheng, Y . Zhang, G. Wang, and Z. Zhao, “Camp: A cross-view geo-localization method using contrastive attributes mining and position-aware partitioning,”IEEE Trans. Geosci. Remote Sens., 2024

  22. [22]

    Segcn: A semantic-aware graph convolutional network for uav geo-localization,

    X. Liu, Z. Wang, Y . Wu, and Q. Miao, “Segcn: A semantic-aware graph convolutional network for uav geo-localization,”IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens., vol. 17, pp. 6055–6066, 2024

  23. [23]

    Anyloc: Towards universal visual place recognition,

    N. Keetha, A. Mishra, J. Karhade, K. M. Jatavallabhula, S. Scherer, M. Krishna, and S. Garg, “Anyloc: Towards universal visual place recognition,”IEEE Robot. Autom. Lett., vol. 9, no. 2, pp. 1286–1293, 2023

  24. [24]

    DINOv3

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoaet al., “Dinov3,” arXiv:2508.10104, 2025

  25. [25]

    Fast normalized cross-correlation,

    J.-C. Yoo and T. H. Han, “Fast normalized cross-correlation,”Circuits, Syst. Signal Process., vol. 28, no. 6, pp. 819–843, 2009

  26. [26]

    Superpoint: Self- supervised interest point detection and description,

    D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self- supervised interest point detection and description,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2018, pp. 224–236

  27. [27]

    Loftr: Detector-free local feature matching with transformers,

    J. Sun, Z. Shen, Y . Wang, H. Bao, and X. Zhou, “Loftr: Detector-free local feature matching with transformers,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 8922–8931

  28. [28]

    Roma: Robust dense feature matching,

    J. Edstedt, Q. Sun, G. B ¨okman, M. Wadenb¨ack, and M. Felsberg, “Roma: Robust dense feature matching,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 19 790–19 800

  29. [29]

    Minima: Modality invariant image matching,

    J. Ren, X. Jiang, Z. Li, D. Liang, X. Zhou, and X. Bai, “Minima: Modality invariant image matching,” inProc. Comput. Vis. Pattern Recognit. Conf. (CVPR), 2025, pp. 23 059–23 068

  30. [30]

    Os-fpi: A coarse- to-fine one-stream network for uav geolocalization,

    J. Chen, E. Zheng, M. Dai, Y . Chen, and Y . Lu, “Os-fpi: A coarse- to-fine one-stream network for uav geolocalization,”IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens., vol. 17, pp. 7852–7866, 2024

  31. [31]

    Enhancing uav geo-location with multi-modal transformer networks: The mmglt approach,

    W. Xu, N. Chen, J. Yuan, J. Fan, W. Chen, and E. Zheng, “Enhancing uav geo-location with multi-modal transformer networks: The mmglt approach,”IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens., 2026

  32. [32]

    Uav-tirvis: A benchmark dataset for thermal–visible image registration from aerial platforms,

    C.-E. Vasile, C. B ˆır˘a, and R. Hobincu, “Uav-tirvis: A benchmark dataset for thermal–visible image registration from aerial platforms,”J. Imag., vol. 11, no. 12, p. 432, 2025

  33. [33]

    Mcgs-reid: A visible-infrared vehicle reidentification method using modal-cross graph sampler,

    J. Liu, C. Zhao, C. Zhao, N. Su, W. Lu, Y . Yan, S. Feng, and Y . Qu, “Mcgs-reid: A visible-infrared vehicle reidentification method using modal-cross graph sampler,”IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens., vol. 18, pp. 18 806–18 818, 2024

  34. [34]

    Multimodal absolute visual localization for unmanned aerial vehicles,

    Z. Liu, H. Li, Z. Zhang, Y . Lyu, and J. Xiong, “Multimodal absolute visual localization for unmanned aerial vehicles,”IEEE Trans. Veh. Technol., vol. 73, no. 11, pp. 16 402–16 415, 2024

  35. [35]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv:2304.07193, 2023

  36. [36]

    Fine-tuning cnn image retrieval with no human annotation,

    F. Radenovi ´c, G. Tolias, and O. Chum, “Fine-tuning cnn image retrieval with no human annotation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 7, pp. 1655–1668, 2018

  37. [37]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv:2010.11929, 2020

  38. [38]

    Pnp problem revisited,

    Y . Wu and Z. Hu, “Pnp problem revisited,”J. Math. Imag. Vis., vol. 24, no. 1, pp. 131–141, 2006

  39. [39]

    Locality preserving matching,

    J. Ma, J. Zhao, J. Jiang, H. Zhou, and X. Guo, “Locality preserving matching,”Int. J. Comput. Vis., vol. 127, no. 5, pp. 512–531, 2019

  40. [40]

    Reliable image matching via photometric and geometric constraints structured by delaunay triangulation,

    S. Jiang and W. Jiang, “Reliable image matching via photometric and geometric constraints structured by delaunay triangulation,”ISPRS J. Photogrammetry Remote Sens., vol. 153, pp. 1–20, 2019

  41. [41]

    Visual place recognition: A survey,

    S. Lowry, N. S ¨underhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford, “Visual place recognition: A survey,”IEEE Trans. Robot., vol. 32, no. 1, pp. 1–19, 2015

  42. [42]

    Vins-mono: A robust and versatile monocular visual-inertial state estimator,

    T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monocular visual-inertial state estimator,”IEEE Trans. Robot., vol. 34, no. 4, pp. 1004–1020, 2018

  43. [43]

    A micro lie theory f or state es timation in robotics,

    J. Sola, J. Deray, and D. Atchuthan, “A micro lie theory for state estimation in robotics,”arXiv:1812.01537, 2018

  44. [44]

    Fundamentals of statistical signal processing: Estima- tion theory,

    S. K. Sengijpta, “Fundamentals of statistical signal processing: Estima- tion theory,” 1995

  45. [45]

    On degeneracy of optimization- based state estimation problems,

    J. Zhang, M. Kaess, and S. Singh, “On degeneracy of optimization- based state estimation problems,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA). IEEE, 2016, pp. 809–816

  46. [46]

    Ortholoc: Uav 6-dof localization and calibration using orthographic geodata.arXiv preprint arXiv:2509.18350, 2025

    O. Dhaouadi, R. Marin, J. Meier, J. Kaiser, and D. Cremers, “Ortholoc: Uav 6-dof localization and calibration using orthographic geodata,” arXiv:2509.18350, 2025

  47. [47]

    Geovins: Geographic-visual-inertial navigation system for large-scale drift-free aerial state estimation,

    C. Li, M. He, C. Chen, J. Liu, X. Lyu, G. Huang, and Z. Meng, “Geovins: Geographic-visual-inertial navigation system for large-scale drift-free aerial state estimation,”IEEE Trans. Robot., 2025. Xiaoran Zhangreceived the B.E. degree in sim- ulation engineering from the National University of Defense Technology, Changsha, China, in 2024, where he is curren...