pith. sign in

arxiv: 2508.01608 · v2 · pith:JUIKTUPAnew · submitted 2025-08-03 · 💻 cs.CV

From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models

Pith reviewed 2026-05-21 23:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords image geolocalizationlarge language modelsgeospatial biasvisual reasoningbenchmarkspatial intelligenceAI evaluation
0
0 comments X

The pith

Large language models display geospatial biases in image geolocalization, performing better in high-resource regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents IMAGEO-Bench as a new evaluation framework for testing how accurately large language models can determine the geographic location shown in an image. Experiments with ten current LLMs across global street scenes, U.S. points of interest, and private unseen images reveal that closed-source models generally produce stronger location reasoning than open-source ones. The results also show consistent performance gaps, with higher accuracy in areas such as North America, Western Europe, and California and lower accuracy in underrepresented parts of the world. Regression analysis ties successful predictions mainly to the presence of urban settings, outdoor conditions, street-level views, and recognizable landmarks. These findings matter because image-based location inference supports crisis response, digital forensics, and location intelligence, and unchecked biases could limit reliability in less-documented regions.

Core claim

Experiments on IMAGEO-Bench demonstrate that closed-source LLMs generally outperform open-source models in image geolocalization accuracy and reasoning quality. LLMs achieve stronger results in high-resource regions such as North America, Western Europe, and California while showing degraded performance in underrepresented areas. Regression diagnostics indicate that successful geolocalization depends primarily on recognizing urban settings, outdoor environments, street-level imagery, and identifiable landmarks.

What carries the argument

IMAGEO-Bench, a benchmark that measures LLM geolocalization through accuracy, distance error, geospatial bias, and reasoning traces across three datasets of global street scenes, U.S. points of interest, and unseen private images.

If this is right

  • Closed-source LLMs are currently more reliable for image-based location tasks than open-source alternatives.
  • Geolocation-aware AI systems will inherit uneven accuracy favoring well-documented regions.
  • Applications in crisis response and digital forensics may underperform when images come from underrepresented areas.
  • Training data imbalances are a plausible driver of the observed regional performance gaps.
  • The benchmark supplies a repeatable method for tracking progress toward more balanced spatial reasoning in future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Diversifying training images from low-resource regions could reduce the performance gaps observed in the benchmark.
  • The bias pattern suggests that visual reasoning in current LLMs is shaped more by example frequency than by universal spatial understanding.
  • Explicit geospatial modules or region-balanced fine-tuning might be needed to make geolocalization equitable across all locations.
  • In digital forensics, location evidence extracted from images in underrepresented regions may carry higher uncertainty.

Load-bearing premise

The regression diagnostics accurately isolate the primary drivers of successful geolocalization without significant confounding from dataset composition or model-specific training data.

What would settle it

Finding no measurable performance difference between high-resource regions such as California and low-resource regions across all tested models on the same image sets would falsify the geospatial bias claim.

Figures

Figures reproduced from arXiv: 2508.01608 by Bowei Li, Lingyao Li, Min Deng, Qikai Hu, Runlong Yu, Xiaowei Jia, Yang Zhou.

Figure 1
Figure 1. Figure 1: The illustrative framework to implement this study. (a) Data distribution for each benchmark dataset. (b) Sampled images from each [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Benchmark performance based on latitude prediction. (a) Dataset-GSS; (b) Dataset-UPC. Perfect predictions lie on the red dashed [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: U.S. state-level averaged accuracy across models on Dataset [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Image elements are categorized into four semantic groups: [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mixed unigram–bigram word cloud derived from the rea [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Log-odds coefficients from diagnostic logistic regression [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sample images from the benchmark datasets: (a) Dataset-GSS; (b) Dataset-CUS; (c) Dataset-PCW. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt used to elicit structured geolocation reasoning and prediction from LLMs. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Benchmark performance based on longitude prediction. (a) Dataset-GSS, and (b) Dataset-UPC. Perfect predictions lie on the red [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ridge regression coefficients on Dataset-GSS, using log [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Ridge regression coefficients on Dataset-UPC, using log [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Ridge regression coefficients on Dataset-PCW, using log [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Mixed unigram–bigram word cloud derived from the reasoning fields: (a) gpt-4.1-mini, (b) gpt-4.1, (c) gemini-1.5-pro, (d) gemini-2.5- [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
read the original abstract

Image geolocalization, the task of identifying the geographic location depicted in an image, is important for applications in crisis response, digital forensics, and location-based intelligence. While recent advances in large language models (LLMs) offer new opportunities for visual reasoning, their ability to perform image geolocalization remains underexplored. In this study, we introduce a benchmark called IMAGEO-Bench that systematically evaluates accuracy, distance error, geospatial bias, and reasoning process. Our benchmark includes three diverse datasets covering global street scenes, points of interest (POIs) in the United States, and a private collection of unseen images. Through experiments on 10 state-of-the-art LLMs, including both open- and closed-source models, we reveal clear performance disparities, with closed-source models generally showing stronger reasoning. Importantly, we uncover geospatial biases as LLMs tend to perform better in high-resource regions (e.g., North America, Western Europe, and California) while exhibiting degraded performance in underrepresented areas. Regression diagnostics demonstrate that successful geolocalization is primarily dependent on recognizing urban settings, outdoor environments, street-level imagery, and identifiable landmarks. Overall, IMAGEO-Bench provides a rigorous lens into the spatial reasoning capabilities of LLMs and offers implications for building geolocation-aware AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces IMAGEO-Bench, a new benchmark for assessing image geolocalization capabilities in LLMs. It evaluates 10 state-of-the-art models (open- and closed-source) across three datasets—global street scenes, US points of interest, and a private unseen collection—reporting accuracy, distance error, geospatial biases, and reasoning traces. Key findings include performance advantages for closed-source models and systematic geospatial biases favoring high-resource regions (North America, Western Europe, California) over underrepresented areas. Regression diagnostics are used to attribute successful geolocalization primarily to urban settings, outdoor environments, street-level imagery, and identifiable landmarks.

Significance. If the empirical results and regression hold after addressing potential confounds, this benchmark offers a timely, systematic lens on LLM spatial reasoning with direct relevance to applications in crisis response, forensics, and location intelligence. The multi-dataset design and inclusion of both quantitative metrics and qualitative reasoning analysis are strengths; the work also provides reproducible experimental protocols and falsifiable predictions about regional performance gaps.

major comments (2)
  1. [§4.3] §4.3 Regression Diagnostics: The claim that successful geolocalization depends primarily on urban settings, outdoor environments, street-level imagery, and identifiable landmarks rests on regression coefficients. However, the analysis does not report region fixed effects, dataset-specific controls, or explicit checks for feature prevalence imbalances across the three datasets (e.g., higher fraction of street-level/landmark images in North America/Western Europe subsets). This risks attributing performance gaps to the listed features when they may instead reflect training-data exposure or sampling composition, directly affecting the geospatial-bias conclusions.
  2. [§3.2] §3.2 Private Unseen Collection: The description of the private dataset lacks quantitative details on sample size, geographic distribution, collection protocol, and exclusion criteria. Without these, it is difficult to verify that the reported performance disparities and regression results generalize beyond the public datasets or to rule out inherited collection biases that could confound the primary-driver interpretation.
minor comments (3)
  1. [Table 2] Table 2: The distance-error metric is reported without units or normalization details, complicating direct comparison of absolute performance across models and regions.
  2. [Figure 3] Figure 3: The geospatial bias heatmaps would benefit from explicit legend values and a statement on how 'high-resource' vs. 'underrepresented' regions were thresholded.
  3. [§5] §5 Discussion: The implications for geolocation-aware AI systems are stated at a high level; adding concrete recommendations (e.g., data-augmentation strategies for underrepresented regions) would strengthen the applied contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which have helped us identify areas for improvement in our manuscript on IMAGEO-Bench. Below, we provide point-by-point responses to the major comments and describe the revisions we plan to implement.

read point-by-point responses
  1. Referee: [§4.3] §4.3 Regression Diagnostics: The claim that successful geolocalization depends primarily on urban settings, outdoor environments, street-level imagery, and identifiable landmarks rests on regression coefficients. However, the analysis does not report region fixed effects, dataset-specific controls, or explicit checks for feature prevalence imbalances across the three datasets (e.g., higher fraction of street-level/landmark images in North America/Western Europe subsets). This risks attributing performance gaps to the listed features when they may instead reflect training-data exposure or sampling composition, directly affecting the geospatial-bias conclusions.

    Authors: The referee correctly identifies a limitation in our regression diagnostics in §4.3. We did not include region fixed effects or perform explicit checks for feature prevalence imbalances, which could indeed affect the interpretation of whether the performance differences stem from the image characteristics or from differential exposure in training data. In the revised version, we will augment the regression analysis with region fixed effects and dataset-specific controls. We will also include an analysis of feature distributions across regions to address potential confounds. These changes will provide a more rigorous basis for our claims about the primary drivers of geolocalization success and the geospatial biases. revision: yes

  2. Referee: [§3.2] §3.2 Private Unseen Collection: The description of the private dataset lacks quantitative details on sample size, geographic distribution, collection protocol, and exclusion criteria. Without these, it is difficult to verify that the reported performance disparities and regression results generalize beyond the public datasets or to rule out inherited collection biases that could confound the primary-driver interpretation.

    Authors: We agree that additional details on the private unseen collection are necessary for full transparency. In the revised manuscript, we will provide quantitative information regarding the sample size, geographic distribution (in aggregated form to maintain privacy), collection protocol, and exclusion criteria. This will help readers evaluate the generalizability of the results and assess any potential collection biases that might influence the regression findings. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no derivations or self-referential reductions

full rationale

This is an empirical benchmark paper that introduces IMAGEO-Bench, evaluates 10 LLMs across three datasets (global street scenes, US POIs, private unseen images), and reports performance metrics, geospatial biases, and regression diagnostics on factors such as urban settings and landmarks. All central claims rest on experimental outcomes and statistical associations rather than any mathematical derivation chain, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations. No equations, ansatzes, or uniqueness theorems appear in the provided text; the regression diagnostics are presented as post-hoc analysis of observed results, not as a closed loop that reduces the reported biases or drivers to inputs defined within the paper itself. The work is therefore self-contained against external benchmarks and falsifiable through replication on new images or models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivations or theoretical constructs. The central claims rest on standard assumptions in AI evaluation and dataset representativeness rather than new free parameters or invented entities.

axioms (1)
  • domain assumption Standard machine learning evaluation metrics (accuracy, distance error) and regression analysis can reliably measure and explain LLM geolocalization performance.
    Invoked implicitly when reporting performance disparities and regression diagnostics in the abstract.

pith-pipeline@v0.9.0 · 5780 in / 1365 out tokens · 45970 ms · 2026-05-21T23:45:09.323776+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Unlocking Zero-Shot Geospatial Reasoning via Indirect Rewards

    cs.CV 2025-09 unverdicted novelty 6.0

    Geo-R1 uses indirect proxy rewards from cross-view alignment with geolocation metadata to drive reinforcement learning, enabling zero-shot geospatial reasoning that transfers across 25+ tasks and sometimes exceeds sup...

  2. A Guide to Using Social Media as a Geospatial Lens for Studying Public Opinion and Behavior

    cs.SI 2026-04 unverdicted novelty 3.0

    Social media data functions as passive geospatial sensing for public opinion and behavior via a structured workflow and case studies on topics like COVID-19 vaccines and urban accessibility.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 2 Pith papers · 1 internal anchor

  1. [1]

    Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomáš Pajdla, and Josef Sivic

  2. [2]

    NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 5297–5307. https://doi.org/10.1109/CVPR.2016.572

  3. [3]

    Jan Brejcha and Martin Čadík. 2017. State-of-the-art in visual geo-localization. Pattern Analysis and Applications20, 3 (2017), 613–637

  4. [4]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

  5. [5]

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. 14455–14465

  6. [6]

    Brandon Clark, Alec Kerrigan, Parth Parag Kulkarni, Vicente Vivanco Cepeda, and Mubarak Shah. 2023. Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23182– 23190

  7. [7]

    Lizhou Fan, Wenyue Hua, Xiang Li, Kaijie Zhu, Mingyu Jin, Lingyao Li, Haoyang Ling, Jinkui Chi, Jindong Wang, Xin Ma, et al. 2024. Nphardeval4v: A dynamic reasoning benchmark of multimodal large language models.arXiv preprint arXiv:2403.01777(2024)

  8. [8]

    Jiaxian Guo, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Boyang Li, Dacheng Tao, and Steven Hoi. 2023. From images to textual prompts: Zero-shot visual question answering with frozen large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10867–10877

  9. [9]

    James Hays and Alexei A Efros. 2008. IM2GPS: estimating geographic information from a single image. In2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1–8. https://doi.org/10.1109/CVPR.2008.4587784

  10. [10]

    James Hays and Alexei A Efros. 2014. Large-scale image geolocalization. In Multimodal location estimation of videos and images. Springer, 41–62

  11. [11]

    Yujun Hou, Matias Quintana, Maxim Khomiakov, Winston Yap, Jiani Ouyang, Koichi Ito, Zeyu Wang, Tianhong Zhao, and Filip Biljecki. 2024. Global Streetscapes—A comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics.ISPRS Journal of Photogrammetry and Remote Sensing215 (2024), 216–238

  12. [12]

    Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, and Fei Huang. 2024. mplug-paperowl: Scientific diagram analysis with the multimodal large language model. InProceedings of the 32nd ACM International Conference on Multimedia. 6929–6938

  13. [13]

    Wenyue Hua, Kaijie Zhu, Lingyao Li, Lizhou Fan, Shuhang Lin, Mingyu Jin, Haochen Xue, Zelong Li, JinDong Wang, and Yongfeng Zhang. 2024. Disentan- gling logic: The role of context in large language model reasoning capabilities. arXiv preprint arXiv:2406.02787(2024)

  14. [14]

    Neel Jay, Hieu Minh Nguyen, Trung Dung Hoang, and Jacob Haimes. 2025. Evaluating precise geolocation inference capabilities of vision language models. arXiv preprint arXiv:2502.14412(2025)

  15. [15]

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision. InProc. Int. Conf. on Machine Learning (ICML). 4904–4916

  16. [16]

    Hyo Jin Kim, Enrique Dunn, and Jan-Michael Frahm. 2017. Learned contex- tual feature reweighting for image geo-localization. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2136–2145

  17. [17]

    VijayaKumar Kadha, Sambit Bakshi, and Santos Kumar Das. 2025. Unravelling Digital Forgeries: A Systematic Survey on Image Manipulation Detection and Localization.Comput. Surveys57, 12 (2025), 1–36

  18. [18]

    Jiayi Kuang, Ying Shen, Jingyou Xie, Haohao Luo, Zhe Xu, Ronghao Li, Yinghui Li, Xianfeng Cheng, Xika Lin, and Yu Han. 2025. Natural language understanding and inference with mllm in visual question answering: A survey.Comput. Surveys 57, 8 (2025), 1–36

  19. [19]

    Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, et al. 2023. Geo-bench: Toward foundation models for earth monitoring. Advances in Neural Information Processing Systems36 (2023), 51080–51093

  20. [20]

    Hao Li, Fabian Deuser, Wenping Yin, Xuanshu Luo, Paul Walther, Gengchen Mai, Wei Huang, and Martin Werner. 2025. Cross-view geolocalization and disaster mapping with street-view and VHR satellite imagery: A case study of Hurricane IAN.ISPRS Journal of Photogrammetry and Remote Sensing220 (2025), 841–854

  21. [21]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Boot- strapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. InProc. 40th Int. Conf. on Machine Learning (ICML), Vol. 202. PMLR, 19730–19742

  22. [22]

    Lingyao Li, Dawei Li, Zhenhui Ou, Xiaoran Xu, Jingxiao Liu, Zihui Ma, Runlong Yu, and Min Deng. 2025. LLMs as World Models: Data-Driven and Human- Centered Pre-Event Simulation for Disaster Impact Assessment.arXiv preprint arXiv:2506.06355(2025)

  23. [23]

    Ling Li, Yao Zhou, Yuxuan Liang, Fugee Tsung, and Jiaheng Wei. 2025. Recogni- tion through Reasoning: Reinforcing Image Geo-localization with Large Vision- Language Models.arXiv preprint arXiv:2506.14674(2025)

  24. [24]

    Xin Li, Yunfei Wu, Xinghua Jiang, Zhihao Guo, Mingming Gong, Haoyu Cao, Yinsong Liu, Deqiang Jiang, and Xing Sun. 2024. Enhancing visual document understanding with contrastive learning in large visual-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15546–15555

  25. [25]

    Tsung-Yi Lin, Serge Belongie, and James Hays. 2013. Cross-view image geolo- calization. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 891–898

  26. [26]

    Eric Muller-Budack, Kader Pustu-Iren, and Ralph Ewerth. 2018. Geolocation estimation of photos using a hierarchical model and scene classification. In Proceedings of the European conference on computer vision (ECCV). 563–579

  27. [27]

    Shraman Pramanick, Ewa M Nowara, Joshua Gleason, Carlos D Castillo, and Rama Chellappa. 2022. Where in the world is this image? transformer-based geo-localization in the wild. InEuropean Conference on Computer Vision. Springer, 196–215

  28. [28]

    Alec Radford, Jong Wook Kim, and Christopheret al.Hallacy. 2021. Learning Transferable Visual Models from Natural Language Supervision. InProc. Int. Conf. on Machine Learning (ICML). 8748–8763

  29. [29]

    Noe Samano, Mengjie Zhou, and Andrew Calway. 2020. You are here: Geolocation by embedding maps and images. InEuropean Conference on Computer Vision. Springer, 502–518

  30. [30]

    Paul Hongsuck Seo, Tobias Weyand, Jack Sim, and Bohyung Han. 2018. Cplanet: Enhancing image geolocalization by combinatorial partitioning of maps. InPro- ceedings of the European Conference on Computer Vision (ECCV). 536–551

  31. [31]

    David G Shatwell, Ishan Rajendrakumar Dave, Sirnam Swetha, and Mubarak Shah. 2025. GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space.arXiv preprint arXiv:2507.10473(2025)

  32. [32]

    Zhicheng Shi, Yang Li, Siyu Li, and Jiebo Luo. 2020. SAFA: Structure-Aware Feature Aggregation for Cross-View Image-Based Geo-Localization. InProc. ACM Int. Conf. on Multimedia (MM). 1633–1641. https://doi.org/10.1145/3394171. 3413569

  33. [33]

    Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. 2025. Nuscenes-spatialqa: A spatial understanding and rea- soning benchmark for vision-language models in autonomous driving.arXiv preprint arXiv:2504.03164(2025)

  34. [34]

    Yicong Tian, Chen Chen, and Mubarak Shah. 2017. Cross-view image matching for geo-localization in urban environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3608–3616

  35. [35]

    Vicente Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. 2023. Geo- CLIP: Clip-Inspired Alignment between Locations and Images for Effective World- wide Geo-localization. InAdvances in Neural Information Processing Systems (NeurIPS)

  36. [36]

    Nam Vo and David W. Jacobs. 2017. Revisiting IM2GPS in the Deep Learning Era. arXiv preprint arXiv:1705.04838(2017)

  37. [37]

    Zhiqiang Wang, Dejia Xu, Rana Muhammad Shahroz Khan, Yanbin Lin, Zhiwen Fan, and Xingquan Zhu. 2024. Llmgeo: Benchmarking large language models on image geolocation in-the-wild.arXiv preprint arXiv:2405.20363(2024)

  38. [38]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

  39. [39]

    Tobias Weyand, Ilya Kostrikov, and James Philbin. 2016. PlaNet - Photo Geoloca- tion with Convolutional Neural Networks. InEuropean Conference on Computer Vision (ECCV). Springer, 37–55. https://doi.org/10.1007/978-3-319-46484-8_3

  40. [40]

    Scott Workman, Richard Souvenir, and Nathan Jacobs. 2015. Wide-area image geolocalization with aerial reference imagery. InProceedings of the IEEE Interna- tional Conference on Computer Vision. 3961–3969

  41. [41]

    Meiliu Wu and Qunying Huang. 2022. Im2city: image geo-localization via multi- modal learning. InProceedings of the 5th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery. 50–61

  42. [42]

    Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. 2024. Mind’s eye of LLMs: visualization-of-thought elicits spatial reasoning in large language models.Advances in Neural Information Processing Systems37 (2024), 90277–90317

  43. [43]

    Shixiong Xu, Chenghao Zhang, Lubin Fan, Gaofeng Meng, Shiming Xiang, and Jieping Ye. 2024. Addressclip: Empowering vision-language models for city-wide image address localization. InEuropean Conference on Computer Vision. Springer, 76–92

  44. [44]

    An Yan, Zhankui He, Jiacheng Li, Tianyang Zhang, and Julian McAuley. 2023. Personalized showcases: Generating multi-modal explanations for recommenda- tions. InProceedings of the 46th International ACM SIGIR Conference on Research Conference acronym ’XX, , Lingyao Li, Runlong Yu, Qikai Hu, Bowei Li, Min Deng, Yang Zhou, and Xiaowei Jia and Development in ...

  45. [45]

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

  46. [46]

    InProceedings of the Computer Vision and Pattern Recognition Conference

    Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference. 10632–10643

  47. [47]

    Qiang Yi and Lianlei Shan. 2025. Geolocsft: Efficient visual geolocation via supervised fine-tuning of multimodal foundation models.arXiv preprint arXiv:2506.01277(2025)

  48. [48]

    Wenping Yin, Yong Xue, Ziqi Liu, Hao Li, and Martin Werner. 2025. LLM- enhanced disaster geolocalization using implicit geoinformation from multimodal data: A case study of Hurricane Harvey.International Journal of Applied Earth Observation and Geoinformation137 (2025), 104423

  49. [49]

    Amir Roshan Zamir and Mubarak Shah. 2014. Image geo-localization based on multiplenearest neighbor feature matching usinggeneralized graphs.IEEE transactions on pattern analysis and machine intelligence36, 8 (2014), 1546–1558

  50. [50]

    Yanhua Zhong, Yuqiang Wu, Sheng Zheng, Yi Yang, and Zhiwu Ma. 2021. VIGOR: Cross-View Image Geo-Localization Beyond One-to-One Retrieval. In Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). 8636–

  51. [51]

    https://doi.org/10.1109/CVPR46437.2021.00853

  52. [52]

    Zhongliang Zhou, Jielu Zhang, Zihan Guan, Mengxuan Hu, Ni Lao, Lan Mu, Sheng Li, and Gengchen Mai. 2024. Img2Loc: Revisiting image geolocalization using multi-modality foundation models and image-based retrieval-augmented generation. InProceedings of the 47th international acm sigir conference on research and development in information retrieval. 2749–2754

  53. [53]

    You a r e an AI a s s i s t a n t s p e c i a l i z e d i n 2g e o c o d i n g a n a l y s i s from images

    Sijie Zhu, Mubarak Shah, and Chen Chen. 2022. Transgeo: Transformer is all you need for cross-view image geo-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1162–1171. A Sample Images from Benchmark Datasets Sample images from our three datasets in the benchmark are pre- sented in Figure 7. B Data Distrib...