pith. machine review for the scientific record. sign in

arxiv: 2604.12159 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale

Mubarak Shah, Parth Parag Kulkarni, Prakash Chandra Chhipa, Rohit Gupta

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords video geolocalizationGPS retrievaltemporal alignmentdenoising sequence predictiondual encoderself-supervised featureslanguage-aligned featuresglobal localization
0
0 comments X

The pith

VidTAG aligns video frames to GPS coordinates using temporal alignment and denoising for consistent global trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VidTAG, a dual-encoder framework for localizing videos to precise GPS points and trajectories on a global scale. It matches video frames to GPS using self-supervised visual features alongside language-aligned features. The TempGeo module aligns the frame embeddings over time to reduce inconsistencies, while the GeoRefiner module refines the GPS predictions through an encoder-decoder denoising process. This matters because it solves the limitations of coarse classification methods and the impracticality of building global image databases for retrieval, since GPS coordinates are far easier to assemble. The model demonstrates better results than existing approaches on relevant datasets.

Core claim

VidTAG performs frame-to-GPS retrieval with a dual-encoder architecture that leverages both self-supervised and language-aligned features. To handle temporal inconsistencies in video, it introduces the TempGeo module to align frame embeddings and the GeoRefiner module, an encoder-decoder that refines the GPS features using those aligned embeddings, enabling the generation of temporally consistent trajectories at global scale.

What carries the argument

Dual-encoder retrieval system with the TempGeo module for aligning frame embeddings temporally and the GeoRefiner module as an encoder-decoder for denoising and refining the GPS sequence predictions.

If this is right

  • Supports fine-grained GPS localization and trajectory mapping for videos rather than only coarse city-level classification.
  • Enables practical global-scale operation because constructing a GPS coordinate gallery is straightforward and low-cost compared to image galleries.
  • Generates temporally consistent predictions across video frames.
  • Improves accuracy by 20 percent at the 1 km threshold compared to GeoCLIP on Mapillary and GAMa datasets.
  • Outperforms current state-of-the-art by 25 percent on the CityGuessr68k benchmark for coarse-grained video geolocalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applications in forensics and social media verification could benefit from such precise video location data.
  • The approach of combining alignment and denoising might apply to other video analysis tasks requiring sequence consistency.
  • Testing on more diverse video sources could reveal how well it handles extreme variability in conditions.

Load-bearing premise

That self-supervised and language-aligned frame features combined with TempGeo alignment and GeoRefiner denoising will generalize across the full range of global video variability without overfitting to specific datasets.

What would settle it

Running the model on a new, unseen global-scale video dataset with varied conditions and finding that it does not achieve the claimed improvements in accuracy at the 1 km threshold or in trajectory consistency.

Figures

Figures reproduced from arXiv: 2604.12159 by Mubarak Shah, Parth Parag Kulkarni, Prakash Chandra Chhipa, Rohit Gupta.

Figure 1
Figure 1. Figure 1: Frame-wise Video to GPS Sequence Prediction. Our model geolocalizes every frame in a video on a global scale to give a temporally consistent trajectory strengths. Fine-grained approaches (city- or region-level) predominantly use retrieval: a query image is matched to a geo-referenced image gallery, and the query inherits the lo￾cation of its best match. This strategy yields high accuracy but requires heavy… view at source ↗
Figure 2
Figure 2. Figure 2: Temporal Inconsistency Issue. Geolocalizing each frame separately leads to incoherent trajectories. The problem be￾comes worse as the scale increases. Yellow: GT, Red: Predicted. twin encoders for images and segmentations; GeoDecoder [5] introduced a query-based encoder–decoder; and PI￾GEON [13] leveraged the CLIP [39] vision encoder with a clustering strategy. GeoCLIP [54] reframed geolocal￾ization as con… view at source ↗
Figure 3
Figure 3. Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Phase I failure modes Most commonly observed noise patterns in Phase I predictions. Phase II training refines GPS pre￾dictions to address these issues. Yellow: GT, Red: Predicted. 3.4. GeoRefiner Even though TempGeo aligns frame features within a video, the association between visual embeddings and their corre￾sponding GPS embeddings can remain noisy. We therefore introduce GeoRefiner, which refines the GP… view at source ↗
Figure 5
Figure 5. Figure 5: Inference Procedure. Frames are passed through the dual encoders followed by TempGeo to generate frame-wise embeddings, while GPS embeddings are obtained from a gallery of coordinates via the location encoder. The frame-wise embeddings are compared with GPS embeddings to make initial predictions. These predictions, along with corresponding frame embeddings, are refined by the GeoRefiner model, which output… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Results. Mapped near-perfect geolocalized trajectories predicted by VidTAG. Yellow: GT, Red: Predicted. and validation procedures for these baselines are described in Supplementary Sec. F. In addition, we fine-tune classi￾fication models using CLIP and DINOv2 backbones under similar settings; these are used for comparison on MSLS and GAMa. For CityGuessr68k, we compare to the mod￾els evaluated … view at source ↗
Figure 7
Figure 7. Figure 7: Performance vs Throughput Tradeoff. Our model is significantly better with only a minimal throughput decrease. olocalization accuracy but reduces throughput by about 24 FPS. DINOv2[33] and CLIP[39] classification base￾lines are slower still, with only modest performance gains. Retrieval-based GeoCLIP[54] further reduces FPS due to the additional retrieval computation. Overall, these models follow a roughly… view at source ↗
Figure 8
Figure 8. Figure 8: Distance Error vs Number of Frames. Median error is consistent, showing robustness of our model [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Feature Space Visualization comparison for output of GeoCLIP[54] and VidTAG. Figure shows the feature analysis in the form of PCA plots for GeoCLIP and our model. First row shows The GPS spread of sequences in Melbourne and Helsinki respectively. Second row displays corresponding PCA plots for sequence-wise outputs of GeoCLIP, while row 3 displays the corresponding PCA plots for our model VidTAG. We see th… view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative Examples for trajectories of sequences in GaMa[ [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Figure shows confusion matrix heatmaps for 24 most difficult cities in CityGuesr68k for models VidTAG and CityGuessr [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Supplemental error analysis of VidTAG. (a) through (c) Breakdown of common model error types, highlighting dominant failure modes. (d) and (e) different scales of error: predictions are within the locality in vast majority of the cases, is in a different part of the city approximately 20% of the time. Region level errors happen in <2% of cases. L. Limitations and Failure Analysis Our model outperforms the… view at source ↗
Figure 13
Figure 13. Figure 13: Visualization of samples for which our model correctly predicts location within 500 m (0.5 km). [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visualization of samples for which our model correctly predicts location within 1 km but not within 500m (0.5 km). [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Visualization of samples for which our model correctly predicts location within 5 km but not within 1 km. [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Visualization of samples for which our model correctly predicts location within 25km but not within 5km. [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Visualization of samples for which our model cannot predict the location within 25 km. [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
read the original abstract

The task of video geolocalization aims to determine the precise GPS coordinates of a video's origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classification-based approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features using the aligned frame embeddings. Evaluations on Mapillary (MSLS) and GAMa datasets demonstrate our model's ability to generate temporally consistent trajectories and outperform baselines, achieving a 20% improvement at the 1 km threshold over GeoCLIP. We also beat current State-of-the-Art by 25% on global coarse grained video geolocalization (CityGuessr68k). Our approach enables fine-grained video geolocalization and lays a strong foundation for future research. More details on the project webpage: https://parthpk.github.io/vidtag_webpage/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents VidTAG, a dual-encoder framework for fine-grained video-to-GPS geolocalization at global scale. It extracts self-supervised and language-aligned frame features, employs the TempGeo module to enforce temporal alignment of embeddings across video frames, and uses the GeoRefiner encoder-decoder (with denoising sequence prediction) to refine GPS coordinate outputs. The central claims are that this produces temporally consistent trajectories and yields a 20% improvement at the 1 km threshold over GeoCLIP on the Mapillary (MSLS) and GAMa datasets, plus a 25% gain over prior SOTA on the CityGuessr68k benchmark for coarse-grained global video geolocalization.

Significance. If the performance claims prove robust under proper controls, the work would be significant: it sidesteps the infeasibility of maintaining global image galleries by operating directly on compact GPS coordinate sets, while the temporal alignment and denoising components directly tackle inconsistency issues that plague frame-independent retrieval. This could advance practical applications in forensics, social media verification, and exploration. The project webpage is noted as a positive step toward reproducibility.

major comments (2)
  1. [Abstract] Abstract: The headline performance numbers (20% gain at 1 km over GeoCLIP on MSLS/GAMa; 25% on CityGuessr68k) are presented without error bars, ablation tables isolating TempGeo/GeoRefiner contributions versus base features, train/test split definitions, GPS gallery construction/indexing details at global scale, or geographic diversity controls. These omissions are load-bearing because the central claim is that the proposed modules generalize across global video variability; without them the reported outperformance cannot be verified or attributed.
  2. [Abstract] Abstract (and §4 Experiments, if present): No information is supplied on data exclusion rules, potential leakage between video sequences and GPS points, or cross-dataset controls. This directly affects the weakest assumption that self-supervised/language-aligned features plus TempGeo/GeoRefiner will transfer without dataset-specific overfitting.
minor comments (1)
  1. [Abstract] Abstract: The claim that 'constructing a gallery of GPS coordinates is straightforward and inexpensive' would benefit from a short quantitative statement on gallery size and query latency to support the scalability argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of reproducibility and experimental rigor that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline performance numbers (20% gain at 1 km over GeoCLIP on MSLS/GAMa; 25% on CityGuessr68k) are presented without error bars, ablation tables isolating TempGeo/GeoRefiner contributions versus base features, train/test split definitions, GPS gallery construction/indexing details at global scale, or geographic diversity controls. These omissions are load-bearing because the central claim is that the proposed modules generalize across global video variability; without them the reported outperformance cannot be verified or attributed.

    Authors: We agree that the abstract, constrained by length, does not include these details. The full manuscript in Section 4 describes the train/test splits for MSLS, GAMa, and CityGuessr68k, provides ablation studies isolating TempGeo and GeoRefiner, and explains the GPS coordinate gallery construction (a compact set of points rather than images) with indexing via coordinate hashing for global scale. Geographic coverage is achieved through the multi-continental scope of the evaluation datasets. We will add error bars to all reported metrics, include a concise reference to the ablations and splits in the abstract, and expand the GPS gallery description in the revised version to make attribution of gains explicit. revision: yes

  2. Referee: [Abstract] Abstract (and §4 Experiments, if present): No information is supplied on data exclusion rules, potential leakage between video sequences and GPS points, or cross-dataset controls. This directly affects the weakest assumption that self-supervised/language-aligned features plus TempGeo/GeoRefiner will transfer without dataset-specific overfitting.

    Authors: We acknowledge the absence of explicit discussion on these points. In the revised manuscript we will add a new subsection under Experiments detailing: (i) data exclusion rules (removal of low-quality, duplicate, or temporally incomplete sequences), (ii) leakage prevention (ensuring no shared video sequences or GPS coordinates between train and test partitions, verified via sequence ID and coordinate hashing), and (iii) cross-dataset controls (training on one dataset and evaluating zero-shot on others). These additions will directly support the generalization claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external dataset evaluations.

full rationale

The paper introduces a dual-encoder retrieval framework augmented by TempGeo alignment and GeoRefiner denoising modules, then reports performance as measured improvements (20% at 1 km over GeoCLIP on MSLS/GAMa; 25% on CityGuessr68k) against independent baselines. No equations, self-definitional reductions, fitted-parameter predictions, or load-bearing self-citations appear in the provided text that would make any claimed result equivalent to its inputs by construction. The derivation chain consists of architectural choices evaluated on held-out external datasets and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view reveals no explicit free parameters, axioms, or invented entities beyond standard neural network components; relies on conventional self-supervised and language-aligned feature learning.

pith-pipeline@v0.9.0 · 5561 in / 1098 out tokens · 56382 ms · 2026-05-10T15:32:27.136118+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 14 canonical work pages · 6 internal anchors

  1. [1]

    Openstreetview-5m: The many roads to global visual geolocation

    Guillaume Astruc, Nicolas Dufour, Ioannis Siglidis, Con- stantin Aronssohn, Nacim Bouia, Stephanie Fu, Romain Loiseau, Van Nguyen Nguyen, Charles Raude, Elliot Vin- cent, et al. Openstreetview-5m: The many roads to global visual geolocation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 21967–21977, 2024. 6, 7

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 6, 7

  3. [3]

    Is space-time attention all you need for video understanding? InICML, page 4, 2021

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InICML, page 4, 2021. 7

  4. [4]

    Gaea: A geolocation aware conversational model.arXiv preprint arXiv:2503.16423, 2025

    Ron Campos, Ashmal Vayani, Parth Parag Kulkarni, Ro- hit Gupta, Aritra Dutta, and Mubarak Shah. Gaea: A geolocation aware conversational model.arXiv preprint arXiv:2503.16423, 2025. 2

  5. [5]

    Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes

    Brandon Clark, Alec Kerrigan, Parth Parag Kulkarni, Vi- cente Vivanco Cepeda, and Mubarak Shah. Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23182–23190, 2023. 1, 2, 6, 7, 5

  6. [6]

    Sam- ple4geo: Hard negative sampling for cross-view geo- localisation

    Fabian Deuser, Konrad Habel, and Norbert Oswald. Sam- ple4geo: Hard negative sampling for cross-view geo- localisation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16847–16856, 2023. 1, 2

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 1

  8. [8]

    Computing discrete fr´echet distance

    Thomas Eiter and Heikki Mannila. Computing discrete fr´echet distance. Technical Report CD-TR 94/64, Christian Doppler Laboratory for Expert Systems, 1994. 6, 4

  9. [9]

    Facil, D

    J.M. Facil, D. Olid, L. Montesano, and J. Civera. Condition- invariant multi-view place recognition.arXiv preprint arXiv:1902.09516, 2019. 3

  10. [10]

    Xi-net: Transformer based seismic waveform reconstructor

    Anshuman Gaharwar, Parth Parag Kulkarni, Joshua Dickey, and Mubarak Shah. Xi-net: Transformer based seismic waveform reconstructor. In2023 IEEE International Confer- ence on Image Processing (ICIP), pages 2725–2729. IEEE,

  11. [11]

    Garg and M

    S. Garg and M. Milford. Seqnet: Learning descriptors for sequence-based hierarchical place recognition.IEEE Robotics and Automation Letters, 6(3):4305–4312, 2021. 3

  12. [12]

    Learning gen- eralized zero-shot learners for open-domain image geolocal- ization.arXiv preprint arXiv:2302.00275, 2023

    Lukas Haas, Silas Alberti, and Michal Skreta. Learning gen- eralized zero-shot learners for open-domain image geolocal- ization.arXiv preprint arXiv:2302.00275, 2023. 6, 7

  13. [13]

    Pigeon: Predicting image geolocations

    Lukas Haas, Michal Skreta, Silas Alberti, and Chelsea Finn. Pigeon: Predicting image geolocations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12893–12902, 2024. 1, 2

  14. [14]

    Im2gps: estimating geo- graphic information from a single image

    James Hays and Alexei A Efros. Im2gps: estimating geo- graphic information from a single image. In2008 ieee con- ference on computer vision and pattern recognition, pages 1–8. IEEE, 2008. 2

  15. [15]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 2

  16. [16]

    Cvm-net: Cross-view matching network for image- based ground-to-aerial geo-localization

    Sixing Hu, Mengdan Feng, Rang MH Nguyen, and Gim Hee Lee. Cvm-net: Cross-view matching network for image- based ground-to-aerial geo-localization. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7258–7267, 2018. 2

  17. [17]

    3d convolu- tional neural networks for human action recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221–231, 2013

    Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolu- tional neural networks for human action recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221–231, 2013. 2

  18. [18]

    G3: an effective and adaptive framework for worldwide geolocalization using large multi- modality models.Advances in Neural Information Process- ing Systems, 37:53198–53221, 2024

    Pengyue Jia, Yiding Liu, Xiaopeng Li, Xiangyu Zhao, Yuhao Wang, Yantong Du, Xiao Han, Xuetao Wei, Shuaiqiang Wang, and Dawei Yin. G3: an effective and adaptive framework for worldwide geolocalization using large multi- modality models.Advances in Neural Information Process- ing Systems, 37:53198–53221, 2024. 2

  19. [19]

    Georanker: Distance-aware ranking for worldwide image geolocalization.arXiv preprint arXiv:2505.13731, 2025

    Pengyue Jia, Seongheon Park, Song Gao, Xiangyu Zhao, and Yixuan Li. Georanker: Distance-aware ranking for worldwide image geolocalization.arXiv preprint arXiv:2505.13731, 2025. 2

  20. [20]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  21. [21]

    Cityguessr: City-level video geo-localization on a global scale

    Parth Parag Kulkarni, Gaurav Kumar Nayak, and Mubarak Shah. Cityguessr: City-level video geo-localization on a global scale. InEuropean Conference on Computer Vision, pages 293–311. Springer, 2024. 1, 2, 3, 5, 6, 7, 8

  22. [22]

    The benchmarking initiative for multimedia evaluation: Mediaeval 2016.IEEE MultiMedia, 24(1):93–96, 2017

    Martha Larson, Mohammad Soleymani, Guillaume Gravier, Bogdan Ionescu, and Gareth JF Jones. The benchmarking initiative for multimedia evaluation: Mediaeval 2016.IEEE MultiMedia, 24(1):93–96, 2017. 2, 3

  23. [23]

    Handwritten digit recognition with a back- propagation network.Advances in neural information pro- cessing systems, 2, 1989

    Yann LeCun, Bernhard Boser, John Denker, Donnie Hen- derson, Richard Howard, Wayne Hubbard, and Lawrence Jackel. Handwritten digit recognition with a back- propagation network.Advances in neural information pro- cessing systems, 2, 1989. 2, 7 9

  24. [24]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    M Lewis. Bart: Denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehension.arXiv preprint arXiv:1910.13461, 2019. 4

  25. [25]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020. 2

  26. [26]

    Georea- soner: Geo-localization with reasoning in street views using a large vision-language model

    Ling Li, Yu Ye, Bingchuan Jiang, and Wei Zeng. Georea- soner: Geo-localization with reasoning in street views using a large vision-language model. InForty-first International Conference on Machine Learning, 2024. 2, 6, 7

  27. [27]

    Lending orientation to neural networks for cross-view geo-localization

    Liu Liu and Hongdong Li. Lending orientation to neural networks for cross-view geo-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5624–5633, 2019. 2

  28. [28]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021. 7

  29. [29]

    Mereu et al

    R. Mereu et al. Learning sequential descriptors for sequence- based visual place recognition. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2022. 3

  30. [30]

    Congeo: Robust cross-view geo-localization across ground view vari- ations.arXiv preprint arXiv:2403.13965, 2024

    Li Mi, Chang Xu, Javiera Castillo-Navarro, Syrielle Montar- iol, Wen Yang, Antoine Bosselut, and Devis Tuia. Congeo: Robust cross-view geo-localization across ground view vari- ations.arXiv preprint arXiv:2403.13965, 2024. 1, 2

  31. [31]

    Maliar, L., Maliar, S., and Valli, F

    Diganta Misra. Mish: A self regularized non-monotonic ac- tivation function.arXiv preprint arXiv:1908.08681, 2019. 2

  32. [32]

    Geolocation estimation of photos using a hierarchical model and scene classification

    Eric Muller-Budack, Kader Pustu-Iren, and Ralph Ewerth. Geolocation estimation of photos using a hierarchical model and scene classification. InProceedings of the European Conference on Computer Vision (ECCV), pages 563–579,

  33. [33]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 3, 7, 8, 1

  34. [34]

    Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019. 1

  35. [35]

    Karl Pearson. Liii. on lines and planes of closest fit to sys- tems of points in space.The London, Edinburgh, and Dublin philosophical magazine and journal of science, 2(11):559– 572, 1901. 5

  36. [36]

    Garet: Cross-view video geolocalization with adapters and auto-regressive transformers.arXiv preprint arXiv:2408.02840, 2024

    Manu S Pillai, Mamshad Nayeem Rizve, and Mubarak Shah. Garet: Cross-view video geolocalization with adapters and auto-regressive transformers.arXiv preprint arXiv:2408.02840, 2024. 2

  37. [37]

    Where in the world is this image? transformer-based geo-localization in the wild

    Shraman Pramanick, Ewa M Nowara, Joshua Gleason, Car- los D Castillo, and Rama Chellappa. Where in the world is this image? transformer-based geo-localization in the wild. InEuropean Conference on Computer Vision, pages 196–

  38. [38]

    Springer, 2022. 1, 2

  39. [39]

    Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019. 8

  40. [40]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 2, 3, 7, 1

  41. [41]

    Cross-view image synthesis using geometry-guided conditional gans.Computer Vision and Image Understanding, 187:102788, 2019

    Krishna Regmi and Ali Borji. Cross-view image synthesis using geometry-guided conditional gans.Computer Vision and Image Understanding, 187:102788, 2019. 2

  42. [42]

    Bridging the domain gap for ground-to-aerial image matching

    Krishna Regmi and Mubarak Shah. Bridging the domain gap for ground-to-aerial image matching. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 470–479, 2019. 2

  43. [43]

    Video geo-localization employing geo-temporal feature learning and gps trajectory smoothing

    Krishna Regmi and Mubarak Shah. Video geo-localization employing geo-temporal feature learning and gps trajectory smoothing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12126–12135, 2021. 2

  44. [44]

    Generalized in- tersection over union: A metric and a loss for bounding box regression

    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666,

  45. [45]

    The equal earth map projection.International Journal of Geographical Information Science, 33(3):454–465, 2019

    Bojan ˇSavriˇc, Tom Patterson, and Bernhard Jenny. The equal earth map projection.International Journal of Geographical Information Science, 33(3):454–465, 2019. 4, 2

  46. [46]

    Cplanet: Enhancing image geolocalization by combi- natorial partitioning of maps

    Paul Hongsuck Seo, Tobias Weyand, Jack Sim, and Bohyung Han. Cplanet: Enhancing image geolocalization by combi- natorial partitioning of maps. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), pages 536– 551, 2018. 2

  47. [47]

    Gt-loc: Unifying when and where in images through a joint embedding space

    David G Shatwell, Ishan Rajendrakumar Dave, Sirnam Swetha, and Mubarak Shah. Gt-loc: Unifying when and where in images through a joint embedding space. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11, 2025. 2

  48. [48]

    Where am i looking at? joint location and orientation es- timation by cross-view matching

    Yujiao Shi, Xin Yu, Dylan Campbell, and Hongdong Li. Where am i looking at? joint location and orientation es- timation by cross-view matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4064–4072, 2020. 2

  49. [49]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014. 2

  50. [50]

    Fourier features let networks learn high frequency functions in low dimen- 10 sional domains.Advances in neural information processing systems, 33:7537–7547, 2020

    Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra- mamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimen- 10 sional domains.Advances in neural information processing systems, 33:7537–7547, 2020. 4, 2

  51. [51]

    Coming down to earth: Satellite-to-street view synthesis for geo-localization

    Aysim Toker, Qunjie Zhou, Maxim Maximov, and Laura Leal-Taix´e. Coming down to earth: Satellite-to-street view synthesis for geo-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6488–6497, 2021. 2

  52. [52]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022. 7

  53. [53]

    City scale geo-spatial trajectory estimation of a mov- ing camera

    Gonzalo Vaca-Castano, Amir Roshan Zamir, and Mubarak Shah. City scale geo-spatial trajectory estimation of a mov- ing camera. In2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1186–1193. IEEE, 2012. 2

  54. [54]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2, 4, 1

  55. [55]

    Geoclip: Clip-inspired alignment be- tween locations and images for effective worldwide geo- localization.Advances in Neural Information Processing Systems, 36, 2024

    Vicente Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. Geoclip: Clip-inspired alignment be- tween locations and images for effective worldwide geo- localization.Advances in Neural Information Processing Systems, 36, 2024. 1, 2, 3, 6, 7, 8, 5

  56. [56]

    Revisiting im2gps in the deep learning era

    Nam V o, Nathan Jacobs, and James Hays. Revisiting im2gps in the deep learning era. InProceedings of the IEEE inter- national conference on computer vision, pages 2621–2630,

  57. [57]

    Gama: Cross- view video geo-localization

    Shruti Vyas, Chen Chen, and Mubarak Shah. Gama: Cross- view video geo-localization. InEuropean Conference on Computer Vision, pages 440–456. Springer, 2022. 2, 5, 6, 1, 3, 7, 8

  58. [58]

    Fairseq s2t: Fast speech-to-text modeling with fairseq.arXiv preprint arXiv:2010.05171, 2020

    Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, and Juan Pino. Fairseq s2t: Fast speech-to-text modeling with fairseq.arXiv preprint arXiv:2010.05171, 2020. 4

  59. [59]

    Mapillary street-level sequences: A dataset for lifelong place recognition

    Frederik Warburg, Soren Hauberg, Manuel Lopez- Antequera, Pau Gargallo, Yubin Kuang, and Javier Civera. Mapillary street-level sequences: A dataset for lifelong place recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2626–2635, 2020. 2, 5, 8, 1, 3, 7

  60. [60]

    Planet- photo geolocation with convolutional neural networks

    Tobias Weyand, Ilya Kostrikov, and James Philbin. Planet- photo geolocation with convolutional neural networks. In European Conference on Computer Vision, pages 37–55. Springer, 2016. 2, 6, 7, 5

  61. [61]

    Wide-area image geolocalization with aerial reference im- agery

    Scott Workman, Richard Souvenir, and Nathan Jacobs. Wide-area image geolocalization with aerial reference im- agery. InProceedings of the IEEE International Conference on Computer Vision, pages 3961–3969, 2015. 2

  62. [62]

    Cross-view geo-localization with layer-to-layer transformer.Advances in Neural Information Processing Systems, 34:29009–29020,

    Hongji Yang, Xiufan Lu, and Yingying Zhu. Cross-view geo-localization with layer-to-layer transformer.Advances in Neural Information Processing Systems, 34:29009–29020,

  63. [63]

    Bdd100k: A diverse driving dataset for heterogeneous multitask learning

    Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Dar- rell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 2636–2645, 2020. 5, 6, 1, 7

  64. [64]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 8

  65. [65]

    Places: A 10 million image database for scene recognition.IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017

    Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017. 5

  66. [66]

    Vigor: Cross- view image geo-localization beyond one-to-one retrieval

    Sijie Zhu, Taojiannan Yang, and Chen Chen. Vigor: Cross- view image geo-localization beyond one-to-one retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3640–3649, 2021. 1, 2

  67. [67]

    Transgeo: Trans- former is all you need for cross-view image geo-localization

    Sijie Zhu, Mubarak Shah, and Chen Chen. Transgeo: Trans- former is all you need for cross-view image geo-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1162–1171, 2022. 1, 2 11 VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale Supplementary...

  68. [68]

    (If there are a few outliers they can be skipped)

    For each region that contains a sizable amount of train- ing data, get the maximum and minimum of both lat- itude and longitude(LAT M IN,LON M IN,LAT M AX, LONM AX). (If there are a few outliers they can be skipped). 3 Table 11. Comparison of Uniform Grid Gallery evaluation and Validation-set Gallery evaluation on Mapillary(MSLS) dataset Model Uniform Gri...

  69. [69]

    Also determine the resolution of the gallery (the finer the resolution the larger the gallery

    To keep an error margin, determine a constant padding which is to be applied to (LAT M IN,LON M IN, LATM AX,LON M AX). Also determine the resolution of the gallery (the finer the resolution the larger the gallery

  70. [70]

    Determine new LATM IN =LAT M IN −padding, LATM AX =LAT M AX +padding, LONM IN =LON M IN −padding, LONM AX =LON M AX +padding

  71. [71]

    Compute distance(dis LAT ) between (LAT M IN, LONM IN) and (LAT M AX,LON M IN), and distance(disLON ) between (LAT M IN,LON M IN) and (LATM IN,LON M AX)

  72. [72]

    Generate a grid with each point at a uniform distance (resolution), dividing the area into equidistant points in both directions Npoints = (disLAT ∗dis LON )//resolution2 D.2. Evaluation with Val-set Gallery Retrieval A real world scenario entails not knowing the actual ground truth values pertaining to the queries, which is the reason for constructing a ...

  73. [73]

    Collect all model predictions and ground truth labels for a particular sequence

  74. [74]

    This will serve as the ground truth label of the entire video sequence

    Compute the centroid of the ground truth labels. This will serve as the ground truth label of the entire video sequence

  75. [75]

    Obtain the prediction that is closest to this centroid (this reduces error due to out- liers)

    Compute the centroid of predictions pertaining to all frames in the sequence. Obtain the prediction that is closest to this centroid (this reduces error due to out- liers). This will serve as the prediction for the entire video sequence

  76. [76]

    Compute distance accuracy at all thresholds using this distance measure

    Finally compute the distance between the prediction and the ground truth label. Compute distance accuracy at all thresholds using this distance measure. This enables us to effectively measure performance of a model at video level. However, this set of metrics cannot be used to judge the quality of formed trajectories. For this exact purpose we use DFD and...

  77. [77]

    As Mapillary(MSLS) is a smaller dataset, we take the entire training split

  78. [78]

    (b) As CityGuessr68k is very large, we sample the data in such a way that the number of sequences is roughly equivalent to MSLS

    For CityGuessr68k, (a) We remove training data from cities which are also present in MSLS to avoid false negatives in the con- trastive loss. (b) As CityGuessr68k is very large, we sample the data in such a way that the number of sequences is roughly equivalent to MSLS. (c) This is achieved by taking all sequences from cities with less than 80 sequences, ...

  79. [79]

    We train a model on this unified data for 200 epochs at an learning rate decay rate of 0.97

    For GaMa, we randomly sample videos so that the cor- pus size is equivalent to the other two. We train a model on this unified data for 200 epochs at an learning rate decay rate of 0.97. Rest of the settings are unchanged. Tables 12 and 13 show evaluation performance of our model on individual datasets. Across all 3 datasets, we observe that our unified m...