pith. sign in

arxiv: 2503.07520 · v5 · submitted 2025-03-10 · 💻 cs.CV · cs.IR

From Limited Labels to Open Domains:An Efficient Learning Method for Drone-view Geo-Localization

Pith reviewed 2026-05-23 00:28 UTC · model grok-4.3

classification 💻 cs.CV cs.IR
keywords drone-view geo-localizationlimited supervisioncross-domain transfercontrastive learningfeature invarianceunpaired datapseudo-label reliability
0
0 comments X

The pith

CDIKTNet uses a small set of paired drone-satellite images to initialize invariant features that reduce confusion when learning from mostly unpaired data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that traditional supervised drone-view geo-localization requires extensive paired training data and struggles when moving to new domains, while unsupervised methods generate unreliable pseudo-labels because similar-looking features appear at different locations. It shows that a cross-domain invariance sub-network can extract structural and spatial invariance from limited paired examples to set up a shared feature space for unpaired images. A cross-domain transfer sub-network then applies dual-path contrastive learning to refine subspaces while keeping them consistent. This closed-loop design yields state-of-the-art results under full supervision and outperforms existing unsupervised approaches in few-shot and cross-domain settings without needing fresh paired data for each new domain.

Core claim

CDIKTNet, built from a cross-domain invariance sub-network (CDIS) and a cross-domain transfer sub-network (CDTS), learns cross-view structural and spatial invariance from a small amount of paired data that serves as prior knowledge; this endows the shared feature space of unpaired data with similar implicit cross-view correlations at initialization, which alleviates feature confusion, after which the CDTS employs dual-path contrastive learning to further optimize each subspace while preserving consistency.

What carries the argument

The cross-domain invariant knowledge transfer network (CDIKTNet) whose CDIS sub-network learns invariance from limited paired data to initialize the feature space and whose CDTS sub-network applies dual-path contrastive learning to optimize subspaces while maintaining shared-space consistency.

If this is right

  • The method reaches state-of-the-art accuracy under full supervision compared with existing supervised drone-view geo-localization approaches.
  • It surpasses current unsupervised methods when only few-shot paired data or cross-domain initialization is available.
  • Deployment to a new domain no longer requires collecting and retraining on new paired data.
  • Feature confusion from geographical similarity and spatial continuity is reduced by the initial invariance prior before contrastive optimization begins.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same limited-pair initialization pattern could be tested on other cross-view matching problems such as satellite-to-ground retrieval or multi-modal image registration.
  • If the choice of which few pairs to label proves critical, active selection strategies might further lower the labeling budget while preserving the reported gains.
  • The closed-loop invariance-plus-contrastive structure suggests hybrid training may scale to additional sensor modalities without requiring domain-specific paired collections.

Load-bearing premise

A small amount of paired data can serve as prior knowledge that endows the shared feature space of unpaired data with similar implicit cross-view correlations at initialization, thereby alleviating feature confusion caused by geographical similarity and spatial continuity.

What would settle it

An experiment in which the paired-data initialization step is removed or replaced by random initialization and performance on unpaired cross-domain test sets remains equal to or better than the full CDIKTNet would falsify the central claim.

Figures

Figures reproduced from arXiv: 2503.07520 by Guoqi Li, Hai-Jun Rong, Jiawei Lang, Zhao-Xu Yang, Zhongwei Chen.

Figure 1
Figure 1. Figure 1: CDIKTNet VS. State-of-the-Art Supervised Methods. (A) The performance variations of the representative method Sample4Geo[1] and CDIKTNet under different proportions of paired data are analyzed in the drone→satellite scenario using University-1652[2] and SUES-200[3] datasets. The evaluation on SUES-200 is performed at an altitude of 150m. It is evident that Sample4Geo generally underperforms CDIKTNet. Speci… view at source ↗
Figure 2
Figure 2. Figure 2: CDIKTNet vs. State-of-the-Art Unsupervised Methods. (a) Spatial continuity causes visually similar patterns from different locations. Unsuper￾vised methods often encounter feature confusion, which results in inaccurate pseudo-labels and drives negative optimization. (b) CDIKTNet leverages a small amount of paired data at initialization to learn cross-view invariance as prior knowledge. Based on this, it ob… view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline Overview. The proposed network pipeline consists of two sub-networks. The cross-domain invariance sub-network is designed to learn feature representations that exhibit structural invariance and spatial invariance. The cross-domain transfer sub-network utilizes the invariant knowledge learned by the cross-domain invariance sub-network as a feature representation space to further explore the latent … view at source ↗
Figure 5
Figure 5. Figure 5: As shown in Fig. 5, increasing the momentum factor [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Influence of CDIS on Convergence, Performance, and Training Efficiency. (a) Loss curves and statistical summaries for baseline vs. baseline with CDIS on University-1652 in setting i. (b) Comparison of performance and training time between baseline and baseline with CDIS on University￾1652 in setting i [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Similarity Distribution and Feature Space Visualization on University-1652 . (a–e) show the similarity distribution visualization. The red-shaded area indicates the overlapping region between the positive and negative distributions. (f-j) show the distribution of feature embeddings in the 2D shared feature space, where circles and triangles in different colors denote drone-view and satellite-view. A total … view at source ↗
Figure 8
Figure 8. Figure 8: Attention Heatmap Visualization. The attention responses generated by the proposed SIEL module are visualized. Red regions correspond to stronger attention responses, whereas blue regions denote weaker responses. Influence of CDIS on Convergence, Performance, and Training Efficiency. To further evaluate the influence of CDIS on convergence, the training behaviors of the baseline and baseline with CDIS are … view at source ↗
read the original abstract

Traditional supervised drone-view geo-localization (DVGL) methods heavily depend on paired training data and encounter difficulties in learning cross-view correlations from unpaired data. Moreover, when deployed in a new domain, these methods require obtaining the new paired data and subsequent retraining for model adaptation, which significantly increases computational overhead. Existing unsupervised methods have enabled to generate pseudo-labels based on cross-view similarity to infer the pairing relationships. However, geographical similarity and spatial continuity often cause visually analogous features at different geographical locations. The feature confusion compromises the reliability of pseudo-label generation, where incorrect pseudo-labels drive negative optimization. Given these challenges inherent in both supervised and unsupervised DVGL methods, we propose a novel cross-domain invariant knowledge transfer network (CDIKTNet) with limited supervision, whose architecture consists of a cross-domain invariance sub-network (CDIS) and a cross-domain transfer sub-network (CDTS). This architecture facilitates a closed-loop framework for invariance feature learning and knowledge transfer. The CDIS is designed to learn cross-view structural and spatial invariance from a small amount of paired data that serves as prior knowledge. It endows the shared feature space of unpaired data with similar implicit cross-view correlations at initialization, which alleviates feature confusion. Based on this, the CDTS employs dual-path contrastive learning to further optimize each subspace while preserving consistency in a shared feature space. Extensive experiments demonstrate that CDIKTNet achieves state-of-the-art performance under full supervision compared with those supervised methods, and further surpasses existing unsupervised methods in both few-shot and cross-domain initialization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CDIKTNet for drone-view geo-localization under limited supervision. It consists of a Cross-domain Invariance Sub-network (CDIS) that learns structural and spatial invariance from a small amount of paired data to initialize the shared feature space of unpaired data with implicit cross-view correlations (reducing confusion from geographical similarity and spatial continuity), followed by a Cross-domain Transfer Sub-network (CDTS) that applies dual-path contrastive learning while preserving consistency. The authors claim this closed-loop architecture yields SOTA performance versus fully supervised methods and outperforms existing unsupervised methods in few-shot and cross-domain settings.

Significance. If the initialization mechanism in CDIS demonstrably transfers useful cross-view correlations and the reported gains hold under rigorous controls, the work would meaningfully reduce annotation requirements for DVGL while enabling better domain adaptation, bridging supervised and unsupervised regimes in a practically relevant computer vision task.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (CDIS description): the central claim that limited paired data 'endows the shared feature space of unpaired data with similar implicit cross-view correlations at initialization, which alleviates feature confusion' is load-bearing for superiority over unsupervised baselines, yet no direct evidence (pre/post-initialization cross-view similarity metrics, feature confusion quantification, or t-SNE visualizations) is supplied to show the transfer imparts task-specific correlations rather than generic alignment.
  2. [Experiments] Experiments section: the abstract asserts 'extensive experiments' establishing SOTA under full supervision and gains in few-shot/cross-domain settings, but without reported dataset splits, training protocols, full quantitative tables, or ablations isolating the CDIS initialization effect, it is impossible to verify that performance differences arise from the proposed mechanism rather than pseudo-label noise or evaluation choices.
minor comments (1)
  1. [Abstract] The abstract contains a minor grammatical issue: 'Existing unsupervised methods have enabled to generate pseudo-labels' should be rephrased for clarity (e.g., 'have enabled generation of pseudo-labels').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight areas where additional evidence and transparency can strengthen the manuscript. We address each major comment below and will incorporate revisions to provide the requested support for our claims.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (CDIS description): the central claim that limited paired data 'endows the shared feature space of unpaired data with similar implicit cross-view correlations at initialization, which alleviates feature confusion' is load-bearing for superiority over unsupervised baselines, yet no direct evidence (pre/post-initialization cross-view similarity metrics, feature confusion quantification, or t-SNE visualizations) is supplied to show the transfer imparts task-specific correlations rather than generic alignment.

    Authors: We agree that direct empirical support for the initialization effect would strengthen the central claim. In the revised manuscript we will add t-SNE visualizations of the shared feature space before and after CDIS initialization on the same unpaired samples, together with quantitative cross-view similarity metrics (e.g., average cosine similarity between matched cross-view pairs) and a simple feature-confusion measure (e.g., intra-class vs. inter-class nearest-neighbor error). These additions will demonstrate that the limited paired data imparts task-specific correlations beyond generic alignment. revision: yes

  2. Referee: [Experiments] Experiments section: the abstract asserts 'extensive experiments' establishing SOTA under full supervision and gains in few-shot/cross-domain settings, but without reported dataset splits, training protocols, full quantitative tables, or ablations isolating the CDIS initialization effect, it is impossible to verify that performance differences arise from the proposed mechanism rather than pseudo-label noise or evaluation choices.

    Authors: We acknowledge that the current experimental section lacks sufficient detail for independent verification. The revised manuscript will include: (i) explicit train/validation/test splits for all datasets, (ii) complete hyper-parameter and training-protocol descriptions, (iii) full quantitative tables with all baselines and our method under identical settings, and (iv) dedicated ablation tables that isolate the CDIS initialization (e.g., CDTS-only vs. CDIS+CDTS) while controlling for pseudo-label generation. These changes will allow readers to attribute performance differences to the proposed mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on proposed architecture and empirical validation

full rationale

The paper introduces CDIKTNet with CDIS (learning invariance from limited paired data as prior knowledge to initialize unpaired features) and CDTS (dual-path contrastive learning). Performance superiority is asserted via experiments against supervised and unsupervised baselines in full, few-shot, and cross-domain settings. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would reduce the central claims to tautological inputs by construction. The initialization mechanism is a design hypothesis whose effect is externally tested rather than presupposed.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Review performed on abstract only; the method introduces two new sub-networks whose effectiveness rests on domain assumptions about feature initialization and contrastive consistency, with no explicit free parameters or invented physical entities named.

axioms (2)
  • domain assumption A small amount of paired data supplies sufficient prior knowledge to initialize cross-view correlations in unpaired data and thereby reduce feature confusion
    Explicitly invoked in the description of the CDIS sub-network
  • domain assumption Dual-path contrastive learning can simultaneously optimize each subspace while preserving consistency in the shared feature space
    Stated as the mechanism of the CDTS sub-network
invented entities (2)
  • Cross-domain invariance sub-network (CDIS) no independent evidence
    purpose: Learn structural and spatial invariance from limited paired data to initialize unpaired feature space
    New architectural component introduced by the paper
  • Cross-domain transfer sub-network (CDTS) no independent evidence
    purpose: Apply dual-path contrastive learning to refine subspaces while keeping shared-space consistency
    New architectural component introduced by the paper

pith-pipeline@v0.9.0 · 5825 in / 1616 out tokens · 66353 ms · 2026-05-23T00:28:16.033958+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Satellite-Free Training for Drone-View Geo-Localization

    cs.CV 2026-04 conditional novelty 7.0

    A satellite-free training framework reconstructs 3D drone scenes via Gaussian splatting, generates geometry-normalized pseudo-orthophotos, and aggregates DINOv3 features with a Fisher vector model trained only on dron...

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Sample4Geo: Hard negative sampling for cross-view geo-localisation,

    F. Deuser, K. Habel, and N. Oswald, “Sample4Geo: Hard negative sampling for cross-view geo-localisation,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 16 847–16 856

  2. [2]

    University-1652: A multi-view multi- source benchmark for drone-based geo-localization,

    Z. Zheng, Y . Wei, and Y . Yang, “University-1652: A multi-view multi- source benchmark for drone-based geo-localization,” inProc. ACM Int. Conf. Multimedia, 2020, p. 1395–1403

  3. [3]

    SUES-200: A multi-height multi-scene cross-view image benchmark across drone and satellite,

    R. Zhu, L. Yin, M. Yang, F. Wu, Y . Yang, and W. Hu, “SUES-200: A multi-height multi-scene cross-view image benchmark across drone and satellite,”IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 9, pp. 4825–4839, 2023

  4. [4]

    BEV-CV: Birds-eye-view transform for cross-view geo-localisation,

    T. Shore, S. Hadfield, and O. Mendez, “BEV-CV: Birds-eye-view transform for cross-view geo-localisation,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2024, pp. 11 048–11 055

  5. [5]

    Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss,

    S. Cai, Y . Guo, S. Khan, J. Hu, and G. Wen, “Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss,” inProc. IEEE Int. Conf. Comput. Vis., 2019, pp. 8391–8400

  6. [6]

    Multi-object tracking meets moving uav,

    S. Liu, X. Li, H. Lu, and Y . He, “Multi-object tracking meets moving uav,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8876–8885

  7. [7]

    Pvt++: A simple end-to-end latency-aware visual tracking framework,

    B. Li, Z. Huang, J. Ye, Y . Li, S. Scherer, H. Zhao, and C. Fu, “Pvt++: A simple end-to-end latency-aware visual tracking framework,” inProc. IEEE Int. Conf. Comput. Vis., 2023, pp. 10 006–10 016

  8. [8]

    Multiple- environment self-adaptive network for aerial-view geo-localization,

    T. Wang, Z. Zheng, Y . Sun, C. Yan, Y . Yang, and T.-S. Chua, “Multiple- environment self-adaptive network for aerial-view geo-localization,” Pattern Recognit., vol. 152, p. 110363, 2024

  9. [9]

    CAMP: Across-view geo-localization method using contrastive attributes mining and position-aware partitioning,

    Q. Wu, Y . Wan, Z. Zheng, Y . Zhang, G. Wang, and Z. Zhao, “CAMP: Across-view geo-localization method using contrastive attributes mining and position-aware partitioning,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–14, 2024

  10. [10]

    Multilevel embedding and alignment network with consistency and invariance learning for cross- view geo-localization,

    Z. Chen, Z.-X. Yang, and H.-J. Rong, “Multilevel embedding and alignment network with consistency and invariance learning for cross- view geo-localization,”IEEE Trans. Geosci. Remote Sens., vol. 63, pp. 1–15, 2025

  11. [11]

    MCCG: A convnext- based multiple-classifier method for cross-view geo-localization,

    T. Shen, Y . Wei, L. Kang, S. Wan, and Y .-H. Yang, “MCCG: A convnext- based multiple-classifier method for cross-view geo-localization,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 3, pp. 1456–1468, 2023

  12. [12]

    CCR: A counterfactual causal reasoning- based method for cross-view geo-localization,

    H. Du, J. He, and Y . Zhao, “CCR: A counterfactual causal reasoning- based method for cross-view geo-localization,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 11, pp. 11 630–11 643, 2024

  13. [13]

    Enhancing cross-view geo-localization with domain alignment and scene consistency,

    P. Xia, Y . Wan, Z. Zheng, Y . Zhang, and J. Deng, “Enhancing cross-view geo-localization with domain alignment and scene consistency,”IEEE Trans. Circuits Syst. Video Technol., pp. 1–12, 2024. 12

  14. [14]

    Unsu- pervised multiview uav image geolocalization via iterative rendering,

    H. Li, C. Xu, W. Yang, L. Mi, H. Yu, H. Zhang, and G.-S. Xia, “Unsu- pervised multiview uav image geolocalization via iterative rendering,” IEEE Trans. Geosci. Remote Sens., vol. 63, pp. 1–15, 2025

  15. [15]

    Learning cross- view visual geo-localization without ground truth,

    H. Li, C. Xu, W. Yang, H. Yu, and G.-S. Xia, “Learning cross- view visual geo-localization without ground truth,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–17, 2024

  16. [16]

    Lending orientation to neural networks for cross-view geo-localization,

    L. Liu and H. Li, “Lending orientation to neural networks for cross-view geo-localization,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, p. 5624–5633

  17. [17]

    Beyond geo-localization: Fine-grained orientation of street-view images by cross-view matching with satellite imagery,

    W. Hu, Y . Zhang, Y . Liang, Y . Yin, A. Georgescu, A. Tran, H. Kruppa, S.-K. Ng, and R. Zimmermann, “Beyond geo-localization: Fine-grained orientation of street-view images by cross-view matching with satellite imagery,” inProc. ACM Int. Conf. Multimedia, 2022, p. 6155–6164

  18. [18]

    Optimal feature transport for cross-view image geo-localization,

    Y . Shi, X. Yu, L. Liu, T. Zhang, and H. Li, “Optimal feature transport for cross-view image geo-localization,” inProc. AAAI Conf. Artif. Intell., vol. 34, 2020, pp. 11 990–11 997

  19. [19]

    Spatial-aware feature aggregation for image based cross-view geo-localization,

    Y . Shi, L. Liu, X. Yu, and H. Li, “Spatial-aware feature aggregation for image based cross-view geo-localization,”in Proc. Adv. Neural Inf. Process. Syst., vol. 32, p. 10090–10100, 2019

  20. [20]

    Cross-view geo- localization via learning disentangled geometric layout correspondence,

    X. Zhang, X. Li, W. Sultani, Y . Zhou, and S. Wshah, “Cross-view geo- localization via learning disentangled geometric layout correspondence,” inProc. AAAI Conf. Artif. Intell., vol. 37, no. 3, 2023, pp. 3480–3488

  21. [21]

    Bridging viewpoints in cross-view geo-localization with siamese vision trans- former,

    W.-J. Ahn, S.-Y . Park, D.-S. Pae, H.-D. Choi, and M.-T. Lim, “Bridging viewpoints in cross-view geo-localization with siamese vision trans- former,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–12, 2024

  22. [22]

    Bridging the domain gap for ground-to-aerial image matching,

    K. Regmi and M. Shah, “Bridging the domain gap for ground-to-aerial image matching,” inProc. IEEE Int. Conf. Comput. Vis., 2019, pp. 470– 479

  23. [23]

    Coming down to earth: Satellite-to-street view synthesis for geo-localization,

    A. Toker, Q. Zhou, M. Maximov, and L. Leal-Taixe, “Coming down to earth: Satellite-to-street view synthesis for geo-localization,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 6488–6497

  24. [24]

    Generative adversarial nets,

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,”in Proc. Adv. Neural Inf. Process. Syst., vol. 27, p. 2672–2680, 2014

  25. [25]

    UA V-satellite view synthesis for cross-view geo-localization,

    X. Tian, J. Shao, D. Ouyang, and H. T. Shen, “UA V-satellite view synthesis for cross-view geo-localization,”IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 7, pp. 4804–4815, 2021

  26. [26]

    Cross-modality image matching network with modality-invariant feature representation for airborne-ground thermal infrared and visible datasets,

    S. Cui, A. Ma, Y . Wan, Y . Zhong, B. Luo, and M. Xu, “Cross-modality image matching network with modality-invariant feature representation for airborne-ground thermal infrared and visible datasets,”IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–14, 2021

  27. [27]

    Multibranch joint representation learning based on information fusion strategy for cross-view geo-localization,

    F. Ge, Y . Zhang, Y . Liu, G. Wang, S. Coleman, D. Kerr, and L. Wang, “Multibranch joint representation learning based on information fusion strategy for cross-view geo-localization,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–16, 2024

  28. [28]

    Direction-guided multi-scale feature fusion network for geo- localization,

    H. Lv, H. Zhu, R. Zhu, F. Wu, C. Wang, M. Cai, and K. Zhang, “Direction-guided multi-scale feature fusion network for geo- localization,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–13, 2024

  29. [29]

    Visible-infrared person re-identification via cross-modality interaction transformer,

    Y . Feng, J. Yu, F. Chen, Y . Ji, F. Wu, S. Liu, and X.-Y . Jing, “Visible-infrared person re-identification via cross-modality interaction transformer,”IEEE Trans. Multimedia, vol. 25, pp. 7647–7659, 2023

  30. [30]

    Dsaf: Dual space alignment framework for visible-infrared person re-identification,

    Y . Jiang, X. Cheng, H. Yu, X. Liu, H. Chen, and G. Zhao, “Dsaf: Dual space alignment framework for visible-infrared person re-identification,” IEEE Trans. Multimedia, vol. 27, pp. 5591–5603, 2025

  31. [31]

    Cross-modality semantic consistency learning for visible-infrared per- son re-identification,

    M. Liu, Z. Zhang, Y . Bian, X. Wang, Y . Sun, B. Zhang, and Y . Wang, “Cross-modality semantic consistency learning for visible-infrared per- son re-identification,”IEEE Trans. Multimedia, vol. 27, pp. 568–580, 2025

  32. [32]

    Clip-driven semantic discovery network for visible-infrared person re-identification,

    X. Yu, N. Dong, L. Zhu, H. Peng, and D. Tao, “Clip-driven semantic discovery network for visible-infrared person re-identification,”IEEE Trans. Multimedia, vol. 27, pp. 4137–4150, 2025

  33. [33]

    Occlusion-aware feature recover model for occluded person re-identification,

    Y . Bian, M. Liu, X. Wang, Y . Tang, and Y . Wang, “Occlusion-aware feature recover model for occluded person re-identification,”IEEE Trans. Multimedia, vol. 26, pp. 5284–5295, 2024

  34. [34]

    Deep learning for person re-identification: A survey and outlook,

    M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. Hoi, “Deep learning for person re-identification: A survey and outlook,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 2872–2893, 2021

  35. [35]

    Exploring the essence of relationships for scene graph generation via causal features enhancement network,

    H. Zhou, T. Luo, J. Zhang, and L. Liu, “Exploring the essence of relationships for scene graph generation via causal features enhancement network,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 8, pp. 6616–6630, 2025

  36. [36]

    Nonconvex and discriminative transfer subspace learning for unsupervised domain adaptation,

    Y . Liu and T. Luo, “Nonconvex and discriminative transfer subspace learning for unsupervised domain adaptation,”Front. Comput. Sci., vol. 19, no. 2, p. 192307, 2025

  37. [37]

    Cross-domain adaptive teacher for object detection,

    Y .-J. Li, X. Dai, C.-Y . Ma, Y .-C. Liu, K. Chen, B. Wu, Z. He, K. Kitani, and P. Vajda, “Cross-domain adaptive teacher for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 7581–7590

  38. [38]

    Semi-supervised learning with label proportion,

    N. Sun, T. Luo, W. Zhuge, H. Tao, C. Hou, and D. Hu, “Semi-supervised learning with label proportion,”IEEE Trans. Knowl. Data Eng., vol. 35, no. 1, pp. 877–890, 2021

  39. [39]

    Semi-supervised multiview feature selection with adaptive graph learning,

    B. Jiang, X. Wu, X. Zhou, Y . Liu, A. G. Cohn, W. Sheng, and H. Chen, “Semi-supervised multiview feature selection with adaptive graph learning,”IEEE Trans. Neural Networks Learn. Syst., vol. 35, no. 3, pp. 3615–3629, 2022

  40. [40]

    Cross-domain gradient discrepancy minimization for unsupervised domain adaptation,

    Z. Du, J. Li, H. Su, L. Zhu, and K. Lu, “Cross-domain gradient discrepancy minimization for unsupervised domain adaptation,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 3937–3946

  41. [41]

    Relation- preserving feature embedding for unsupervised person re-identification,

    X. Wang, M. Liu, F. Wang, J. Dai, A.-A. Liu, and Y . Wang, “Relation- preserving feature embedding for unsupervised person re-identification,” IEEE Trans. Multimedia, vol. 26, pp. 714–723, 2024

  42. [42]

    Dreamt: Diversity enlarged mu- tual teaching for unsupervised domain adaptive person re-identification,

    Y . Tao, J. Zhang, J. Hong, and Y . Zhu, “Dreamt: Diversity enlarged mu- tual teaching for unsupervised domain adaptive person re-identification,” IEEE Trans. Multimedia, vol. 25, pp. 4586–4597, 2023

  43. [43]

    Camera invariant feature learn- ing for unsupervised person re-identification,

    Z. Pang, L. Zhao, Q. Liu, and C. Wang, “Camera invariant feature learn- ing for unsupervised person re-identification,”IEEE Trans. Multimedia, vol. 25, pp. 6171–6182, 2023

  44. [44]

    Ice: Inter-instance contrastive encoding for unsupervised person re-identification,

    H. Chen, B. Lagadec, and F. Bremond, “Ice: Inter-instance contrastive encoding for unsupervised person re-identification,” inProc. IEEE Int. Conf. Comput. Vis., 2021, pp. 14 960–14 969

  45. [45]

    Intra-inter camera similarity for unsupervised person re-identification,

    S. Xuan and S. Zhang, “Intra-inter camera similarity for unsupervised person re-identification,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 11 926–11 935

  46. [46]

    Cluster contrast for unsupervised person re-identification,

    Z. Dai, G. Wang, W. Yuan, S. Zhu, and P. Tan, “Cluster contrast for unsupervised person re-identification,” inProc. Asian Conf. Comput. Vis., 2022, pp. 1142–1160

  47. [47]

    A density-based algorithm for discovering clusters in large spatial databases with noise,

    M. Ester, H.-P. Kriegel, J. Sander, X. Xuet al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” inProc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., vol. 96, no. 34, 1996, pp. 226–231

  48. [48]

    Augmented dual-contrastive aggregation learning for unsupervised visible-infrared person re- identification,

    B. Yang, M. Ye, J. Chen, and Z. Wu, “Augmented dual-contrastive aggregation learning for unsupervised visible-infrared person re- identification,” inProc. ACM Int. Conf. Multimedia, 2022, pp. 2843– 2851

  49. [49]

    Shallow-deep collaborative learning for unsupervised visible-infrared person re-identification,

    B. Yang, J. Chen, and M. Ye, “Shallow-deep collaborative learning for unsupervised visible-infrared person re-identification,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 16 870–16 879

  50. [50]

    A convnet for the 2020s,

    Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11 976–11 986

  51. [51]

    Squeeze-and-excitation networks,

    J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141

  52. [52]

    Representation Learning with Contrastive Predictive Coding

    A. v. d. Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018

  53. [53]

    Augmented dual-contrastive aggregation learning for unsupervised visible-infrared person re- identification,

    B. Yang, M. Ye, J. Chen, and Z. Wu, “Augmented dual-contrastive aggregation learning for unsupervised visible-infrared person re- identification,” inACM MM, 2022, p. 2843–2851

  54. [54]

    The global k-means clustering algorithm,

    A. Likas, N. Vlassis, and J. J. Verbeek, “The global k-means clustering algorithm,”Pattern Recognit., vol. 36, no. 2, pp. 451–461, 2003

  55. [55]

    TransFG: A cross- view geo-localization of satellite and UA Vs imagery pipeline using transformer-based feature aggregation and gradient guidance,

    H. Zhao, K. Ren, T. Yue, C. Zhang, and S. Yuan, “TransFG: A cross- view geo-localization of satellite and UA Vs imagery pipeline using transformer-based feature aggregation and gradient guidance,”IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1–12, 2024

  56. [56]

    Revisit anything: Visual place recognition via image segment retrieval,

    K. Garg, S. S. Puligilla, S. Kolathaya, M. Krishna, and S. Garg, “Revisit anything: Visual place recognition via image segment retrieval,” inProc. Eur . Conf. Comput. Vis., 2024, pp. 326–343

  57. [57]

    Anyloc: Towards universal visual place recognition,

    N. Keetha, A. Mishra, J. Karhade, K. M. Jatavallabhula, S. Scherer, M. Krishna, and S. Garg, “Anyloc: Towards universal visual place recognition,”IEEE Rob. Autom. Lett, vol. 9, no. 2, pp. 1286–1293, 2023

  58. [58]

    Vision- based uav self-positioning in low-altitude urban environments,

    M. Dai, E. Zheng, Z. Feng, L. Qi, J. Zhuang, and W. Yang, “Vision- based uav self-positioning in low-altitude urban environments,”IEEE Trans. Image Process., vol. 33, pp. 493–508, 2024

  59. [59]

    Visualizing data using t-SNE

    L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE.”J. Mach. Learn. Res, vol. 9, no. 11, pp. 2579–2605, 2008