From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models

Bowei Li; Lingyao Li; Min Deng; Qikai Hu; Runlong Yu; Xiaowei Jia; Yang Zhou

arxiv: 2508.01608 · v2 · pith:JUIKTUPAnew · submitted 2025-08-03 · 💻 cs.CV

From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models

Lingyao Li , Runlong Yu , Qikai Hu , Bowei Li , Min Deng , Yang Zhou , Xiaowei Jia This is my paper

Pith reviewed 2026-05-21 23:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords image geolocalizationlarge language modelsgeospatial biasvisual reasoningbenchmarkspatial intelligenceAI evaluation

0 comments

The pith

Large language models display geospatial biases in image geolocalization, performing better in high-resource regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents IMAGEO-Bench as a new evaluation framework for testing how accurately large language models can determine the geographic location shown in an image. Experiments with ten current LLMs across global street scenes, U.S. points of interest, and private unseen images reveal that closed-source models generally produce stronger location reasoning than open-source ones. The results also show consistent performance gaps, with higher accuracy in areas such as North America, Western Europe, and California and lower accuracy in underrepresented parts of the world. Regression analysis ties successful predictions mainly to the presence of urban settings, outdoor conditions, street-level views, and recognizable landmarks. These findings matter because image-based location inference supports crisis response, digital forensics, and location intelligence, and unchecked biases could limit reliability in less-documented regions.

Core claim

Experiments on IMAGEO-Bench demonstrate that closed-source LLMs generally outperform open-source models in image geolocalization accuracy and reasoning quality. LLMs achieve stronger results in high-resource regions such as North America, Western Europe, and California while showing degraded performance in underrepresented areas. Regression diagnostics indicate that successful geolocalization depends primarily on recognizing urban settings, outdoor environments, street-level imagery, and identifiable landmarks.

What carries the argument

IMAGEO-Bench, a benchmark that measures LLM geolocalization through accuracy, distance error, geospatial bias, and reasoning traces across three datasets of global street scenes, U.S. points of interest, and unseen private images.

If this is right

Closed-source LLMs are currently more reliable for image-based location tasks than open-source alternatives.
Geolocation-aware AI systems will inherit uneven accuracy favoring well-documented regions.
Applications in crisis response and digital forensics may underperform when images come from underrepresented areas.
Training data imbalances are a plausible driver of the observed regional performance gaps.
The benchmark supplies a repeatable method for tracking progress toward more balanced spatial reasoning in future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Diversifying training images from low-resource regions could reduce the performance gaps observed in the benchmark.
The bias pattern suggests that visual reasoning in current LLMs is shaped more by example frequency than by universal spatial understanding.
Explicit geospatial modules or region-balanced fine-tuning might be needed to make geolocalization equitable across all locations.
In digital forensics, location evidence extracted from images in underrepresented regions may carry higher uncertainty.

Load-bearing premise

The regression diagnostics accurately isolate the primary drivers of successful geolocalization without significant confounding from dataset composition or model-specific training data.

What would settle it

Finding no measurable performance difference between high-resource regions such as California and low-resource regions across all tested models on the same image sets would falsify the geospatial bias claim.

Figures

Figures reproduced from arXiv: 2508.01608 by Bowei Li, Lingyao Li, Min Deng, Qikai Hu, Runlong Yu, Xiaowei Jia, Yang Zhou.

**Figure 1.** Figure 1: The illustrative framework to implement this study. (a) Data distribution for each benchmark dataset. (b) Sampled images from each [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Benchmark performance based on latitude prediction. (a) Dataset-GSS; (b) Dataset-UPC. Perfect predictions lie on the red dashed [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: U.S. state-level averaged accuracy across models on Dataset [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Image elements are categorized into four semantic groups: [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Mixed unigram–bigram word cloud derived from the rea [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 5.** Figure 5: Log-odds coefficients from diagnostic logistic regression [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Sample images from the benchmark datasets: (a) Dataset-GSS; (b) Dataset-CUS; (c) Dataset-PCW. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt used to elicit structured geolocation reasoning and prediction from LLMs. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Benchmark performance based on longitude prediction. (a) Dataset-GSS, and (b) Dataset-UPC. Perfect predictions lie on the red [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Ridge regression coefficients on Dataset-GSS, using log [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 12.** Figure 12: Ridge regression coefficients on Dataset-UPC, using log [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 14.** Figure 14: Ridge regression coefficients on Dataset-PCW, using log [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: Mixed unigram–bigram word cloud derived from the reasoning fields: (a) gpt-4.1-mini, (b) gpt-4.1, (c) gemini-1.5-pro, (d) gemini-2.5- [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

read the original abstract

Image geolocalization, the task of identifying the geographic location depicted in an image, is important for applications in crisis response, digital forensics, and location-based intelligence. While recent advances in large language models (LLMs) offer new opportunities for visual reasoning, their ability to perform image geolocalization remains underexplored. In this study, we introduce a benchmark called IMAGEO-Bench that systematically evaluates accuracy, distance error, geospatial bias, and reasoning process. Our benchmark includes three diverse datasets covering global street scenes, points of interest (POIs) in the United States, and a private collection of unseen images. Through experiments on 10 state-of-the-art LLMs, including both open- and closed-source models, we reveal clear performance disparities, with closed-source models generally showing stronger reasoning. Importantly, we uncover geospatial biases as LLMs tend to perform better in high-resource regions (e.g., North America, Western Europe, and California) while exhibiting degraded performance in underrepresented areas. Regression diagnostics demonstrate that successful geolocalization is primarily dependent on recognizing urban settings, outdoor environments, street-level imagery, and identifiable landmarks. Overall, IMAGEO-Bench provides a rigorous lens into the spatial reasoning capabilities of LLMs and offers implications for building geolocation-aware AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper builds IMAGEO-Bench to test LLM image geolocalization and reports clear performance gaps favoring high-resource regions, though the regression on driving features may be picking up dataset imbalances.

read the letter

Hey, about arXiv:2508.01608. The main things here are a new benchmark called IMAGEO-Bench and the finding that LLMs geolocate images better in places like North America and Western Europe than in underrepresented areas. They test ten models across three datasets and note that closed-source ones show stronger reasoning overall. That bias result is the part with the most direct implications for applications like forensics or crisis mapping. The setup covers accuracy, distance error, bias measurement, and some look at reasoning steps, which gives a more structured view than scattered prior tests. The three datasets add some variety: global street scenes, US points of interest, and a private unseen collection. That combination lets them compare performance patterns without relying on a single source. The regression diagnostics tie success to urban settings, outdoor scenes, street-level views, and landmarks. This is a reasonable starting claim, but the stress-test concern lands. If those features appear more often in the high-resource subsets by how the data was gathered, the coefficients could reflect sampling rather than what the models actually use. The abstract does not lay out the exact controls or feature counts by region, so the full paper needs to show that the regression isolates the intended effects. Dataset sizes, exact metrics, and error bars are also light in the summary, which makes it harder to judge how stable the numbers are. This work is aimed at people evaluating or improving multimodal LLMs for spatial tasks. A reader who needs a concrete way to measure geolocalization ability across models and geographies will find the benchmark and the reported patterns useful to reference or extend. The paper shows straightforward engagement with the evaluation problem and produces new empirical results rather than restating old ones. It deserves peer review so the methods and statistical details can be checked directly.

Referee Report

2 major / 3 minor

Summary. The paper introduces IMAGEO-Bench, a new benchmark for assessing image geolocalization capabilities in LLMs. It evaluates 10 state-of-the-art models (open- and closed-source) across three datasets—global street scenes, US points of interest, and a private unseen collection—reporting accuracy, distance error, geospatial biases, and reasoning traces. Key findings include performance advantages for closed-source models and systematic geospatial biases favoring high-resource regions (North America, Western Europe, California) over underrepresented areas. Regression diagnostics are used to attribute successful geolocalization primarily to urban settings, outdoor environments, street-level imagery, and identifiable landmarks.

Significance. If the empirical results and regression hold after addressing potential confounds, this benchmark offers a timely, systematic lens on LLM spatial reasoning with direct relevance to applications in crisis response, forensics, and location intelligence. The multi-dataset design and inclusion of both quantitative metrics and qualitative reasoning analysis are strengths; the work also provides reproducible experimental protocols and falsifiable predictions about regional performance gaps.

major comments (2)

[§4.3] §4.3 Regression Diagnostics: The claim that successful geolocalization depends primarily on urban settings, outdoor environments, street-level imagery, and identifiable landmarks rests on regression coefficients. However, the analysis does not report region fixed effects, dataset-specific controls, or explicit checks for feature prevalence imbalances across the three datasets (e.g., higher fraction of street-level/landmark images in North America/Western Europe subsets). This risks attributing performance gaps to the listed features when they may instead reflect training-data exposure or sampling composition, directly affecting the geospatial-bias conclusions.
[§3.2] §3.2 Private Unseen Collection: The description of the private dataset lacks quantitative details on sample size, geographic distribution, collection protocol, and exclusion criteria. Without these, it is difficult to verify that the reported performance disparities and regression results generalize beyond the public datasets or to rule out inherited collection biases that could confound the primary-driver interpretation.

minor comments (3)

[Table 2] Table 2: The distance-error metric is reported without units or normalization details, complicating direct comparison of absolute performance across models and regions.
[Figure 3] Figure 3: The geospatial bias heatmaps would benefit from explicit legend values and a statement on how 'high-resource' vs. 'underrepresented' regions were thresholded.
[§5] §5 Discussion: The implications for geolocation-aware AI systems are stated at a high level; adding concrete recommendations (e.g., data-augmentation strategies for underrepresented regions) would strengthen the applied contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which have helped us identify areas for improvement in our manuscript on IMAGEO-Bench. Below, we provide point-by-point responses to the major comments and describe the revisions we plan to implement.

read point-by-point responses

Referee: [§4.3] §4.3 Regression Diagnostics: The claim that successful geolocalization depends primarily on urban settings, outdoor environments, street-level imagery, and identifiable landmarks rests on regression coefficients. However, the analysis does not report region fixed effects, dataset-specific controls, or explicit checks for feature prevalence imbalances across the three datasets (e.g., higher fraction of street-level/landmark images in North America/Western Europe subsets). This risks attributing performance gaps to the listed features when they may instead reflect training-data exposure or sampling composition, directly affecting the geospatial-bias conclusions.

Authors: The referee correctly identifies a limitation in our regression diagnostics in §4.3. We did not include region fixed effects or perform explicit checks for feature prevalence imbalances, which could indeed affect the interpretation of whether the performance differences stem from the image characteristics or from differential exposure in training data. In the revised version, we will augment the regression analysis with region fixed effects and dataset-specific controls. We will also include an analysis of feature distributions across regions to address potential confounds. These changes will provide a more rigorous basis for our claims about the primary drivers of geolocalization success and the geospatial biases. revision: yes
Referee: [§3.2] §3.2 Private Unseen Collection: The description of the private dataset lacks quantitative details on sample size, geographic distribution, collection protocol, and exclusion criteria. Without these, it is difficult to verify that the reported performance disparities and regression results generalize beyond the public datasets or to rule out inherited collection biases that could confound the primary-driver interpretation.

Authors: We agree that additional details on the private unseen collection are necessary for full transparency. In the revised manuscript, we will provide quantitative information regarding the sample size, geographic distribution (in aggregated form to maintain privacy), collection protocol, and exclusion criteria. This will help readers evaluate the generalizability of the results and assess any potential collection biases that might influence the regression findings. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no derivations or self-referential reductions

full rationale

This is an empirical benchmark paper that introduces IMAGEO-Bench, evaluates 10 LLMs across three datasets (global street scenes, US POIs, private unseen images), and reports performance metrics, geospatial biases, and regression diagnostics on factors such as urban settings and landmarks. All central claims rest on experimental outcomes and statistical associations rather than any mathematical derivation chain, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations. No equations, ansatzes, or uniqueness theorems appear in the provided text; the regression diagnostics are presented as post-hoc analysis of observed results, not as a closed loop that reduces the reported biases or drivers to inputs defined within the paper itself. The work is therefore self-contained against external benchmarks and falsifiable through replication on new images or models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivations or theoretical constructs. The central claims rest on standard assumptions in AI evaluation and dataset representativeness rather than new free parameters or invented entities.

axioms (1)

domain assumption Standard machine learning evaluation metrics (accuracy, distance error) and regression analysis can reliably measure and explain LLM geolocalization performance.
Invoked implicitly when reporting performance disparities and regression diagnostics in the abstract.

pith-pipeline@v0.9.0 · 5780 in / 1365 out tokens · 45970 ms · 2026-05-21T23:45:09.323776+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Regression diagnostics demonstrate that successful geolocalization is primarily dependent on recognizing urban settings, outdoor environments, street-level imagery, and identifiable landmarks.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We uncover geospatial biases as LLMs tend to perform better in high-resource regions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unlocking Zero-Shot Geospatial Reasoning via Indirect Rewards
cs.CV 2025-09 unverdicted novelty 6.0

Geo-R1 uses indirect proxy rewards from cross-view alignment with geolocation metadata to drive reinforcement learning, enabling zero-shot geospatial reasoning that transfers across 25+ tasks and sometimes exceeds sup...
A Guide to Using Social Media as a Geospatial Lens for Studying Public Opinion and Behavior
cs.SI 2026-04 unverdicted novelty 3.0

Social media data functions as passive geospatial sensing for public opinion and behavior via a structured workflow and case studies on topics like COVID-19 vaccines and urban accessibility.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomáš Pajdla, and Josef Sivic

work page
[2]

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 5297–5307. https://doi.org/10.1109/CVPR.2016.572

work page doi:10.1109/cvpr.2016.572 2016
[3]

Jan Brejcha and Martin Čadík. 2017. State-of-the-art in visual geo-localization. Pattern Analysis and Applications20, 3 (2017), 613–637

work page 2017
[4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

work page 2020
[5]

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. 14455–14465

work page 2024
[6]

Brandon Clark, Alec Kerrigan, Parth Parag Kulkarni, Vicente Vivanco Cepeda, and Mubarak Shah. 2023. Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23182– 23190

work page 2023
[7]

Lizhou Fan, Wenyue Hua, Xiang Li, Kaijie Zhu, Mingyu Jin, Lingyao Li, Haoyang Ling, Jinkui Chi, Jindong Wang, Xin Ma, et al. 2024. Nphardeval4v: A dynamic reasoning benchmark of multimodal large language models.arXiv preprint arXiv:2403.01777(2024)

work page arXiv 2024
[8]

Jiaxian Guo, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Boyang Li, Dacheng Tao, and Steven Hoi. 2023. From images to textual prompts: Zero-shot visual question answering with frozen large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10867–10877

work page 2023
[9]

James Hays and Alexei A Efros. 2008. IM2GPS: estimating geographic information from a single image. In2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1–8. https://doi.org/10.1109/CVPR.2008.4587784

work page doi:10.1109/cvpr.2008.4587784 2008
[10]

James Hays and Alexei A Efros. 2014. Large-scale image geolocalization. In Multimodal location estimation of videos and images. Springer, 41–62

work page 2014
[11]

Yujun Hou, Matias Quintana, Maxim Khomiakov, Winston Yap, Jiani Ouyang, Koichi Ito, Zeyu Wang, Tianhong Zhao, and Filip Biljecki. 2024. Global Streetscapes—A comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics.ISPRS Journal of Photogrammetry and Remote Sensing215 (2024), 216–238

work page 2024
[12]

Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, and Fei Huang. 2024. mplug-paperowl: Scientific diagram analysis with the multimodal large language model. InProceedings of the 32nd ACM International Conference on Multimedia. 6929–6938

work page 2024
[13]

Wenyue Hua, Kaijie Zhu, Lingyao Li, Lizhou Fan, Shuhang Lin, Mingyu Jin, Haochen Xue, Zelong Li, JinDong Wang, and Yongfeng Zhang. 2024. Disentan- gling logic: The role of context in large language model reasoning capabilities. arXiv preprint arXiv:2406.02787(2024)

work page arXiv 2024
[14]

Neel Jay, Hieu Minh Nguyen, Trung Dung Hoang, and Jacob Haimes. 2025. Evaluating precise geolocation inference capabilities of vision language models. arXiv preprint arXiv:2502.14412(2025)

work page arXiv 2025
[15]

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision. InProc. Int. Conf. on Machine Learning (ICML). 4904–4916

work page 2021
[16]

Hyo Jin Kim, Enrique Dunn, and Jan-Michael Frahm. 2017. Learned contex- tual feature reweighting for image geo-localization. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2136–2145

work page 2017
[17]

VijayaKumar Kadha, Sambit Bakshi, and Santos Kumar Das. 2025. Unravelling Digital Forgeries: A Systematic Survey on Image Manipulation Detection and Localization.Comput. Surveys57, 12 (2025), 1–36

work page 2025
[18]

Jiayi Kuang, Ying Shen, Jingyou Xie, Haohao Luo, Zhe Xu, Ronghao Li, Yinghui Li, Xianfeng Cheng, Xika Lin, and Yu Han. 2025. Natural language understanding and inference with mllm in visual question answering: A survey.Comput. Surveys 57, 8 (2025), 1–36

work page 2025
[19]

Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, et al. 2023. Geo-bench: Toward foundation models for earth monitoring. Advances in Neural Information Processing Systems36 (2023), 51080–51093

work page 2023
[20]

Hao Li, Fabian Deuser, Wenping Yin, Xuanshu Luo, Paul Walther, Gengchen Mai, Wei Huang, and Martin Werner. 2025. Cross-view geolocalization and disaster mapping with street-view and VHR satellite imagery: A case study of Hurricane IAN.ISPRS Journal of Photogrammetry and Remote Sensing220 (2025), 841–854

work page 2025
[21]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Boot- strapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. InProc. 40th Int. Conf. on Machine Learning (ICML), Vol. 202. PMLR, 19730–19742

work page 2023
[22]

Lingyao Li, Dawei Li, Zhenhui Ou, Xiaoran Xu, Jingxiao Liu, Zihui Ma, Runlong Yu, and Min Deng. 2025. LLMs as World Models: Data-Driven and Human- Centered Pre-Event Simulation for Disaster Impact Assessment.arXiv preprint arXiv:2506.06355(2025)

work page arXiv 2025
[23]

Ling Li, Yao Zhou, Yuxuan Liang, Fugee Tsung, and Jiaheng Wei. 2025. Recogni- tion through Reasoning: Reinforcing Image Geo-localization with Large Vision- Language Models.arXiv preprint arXiv:2506.14674(2025)

work page arXiv 2025
[24]

Xin Li, Yunfei Wu, Xinghua Jiang, Zhihao Guo, Mingming Gong, Haoyu Cao, Yinsong Liu, Deqiang Jiang, and Xing Sun. 2024. Enhancing visual document understanding with contrastive learning in large visual-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15546–15555

work page 2024
[25]

Tsung-Yi Lin, Serge Belongie, and James Hays. 2013. Cross-view image geolo- calization. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 891–898

work page 2013
[26]

Eric Muller-Budack, Kader Pustu-Iren, and Ralph Ewerth. 2018. Geolocation estimation of photos using a hierarchical model and scene classification. In Proceedings of the European conference on computer vision (ECCV). 563–579

work page 2018
[27]

Shraman Pramanick, Ewa M Nowara, Joshua Gleason, Carlos D Castillo, and Rama Chellappa. 2022. Where in the world is this image? transformer-based geo-localization in the wild. InEuropean Conference on Computer Vision. Springer, 196–215

work page 2022
[28]

Alec Radford, Jong Wook Kim, and Christopheret al.Hallacy. 2021. Learning Transferable Visual Models from Natural Language Supervision. InProc. Int. Conf. on Machine Learning (ICML). 8748–8763

work page 2021
[29]

Noe Samano, Mengjie Zhou, and Andrew Calway. 2020. You are here: Geolocation by embedding maps and images. InEuropean Conference on Computer Vision. Springer, 502–518

work page 2020
[30]

Paul Hongsuck Seo, Tobias Weyand, Jack Sim, and Bohyung Han. 2018. Cplanet: Enhancing image geolocalization by combinatorial partitioning of maps. InPro- ceedings of the European Conference on Computer Vision (ECCV). 536–551

work page 2018
[31]

David G Shatwell, Ishan Rajendrakumar Dave, Sirnam Swetha, and Mubarak Shah. 2025. GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space.arXiv preprint arXiv:2507.10473(2025)

work page arXiv 2025
[32]

Zhicheng Shi, Yang Li, Siyu Li, and Jiebo Luo. 2020. SAFA: Structure-Aware Feature Aggregation for Cross-View Image-Based Geo-Localization. InProc. ACM Int. Conf. on Multimedia (MM). 1633–1641. https://doi.org/10.1145/3394171. 3413569

work page doi:10.1145/3394171 2020
[33]

Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. 2025. Nuscenes-spatialqa: A spatial understanding and rea- soning benchmark for vision-language models in autonomous driving.arXiv preprint arXiv:2504.03164(2025)

work page arXiv 2025
[34]

Yicong Tian, Chen Chen, and Mubarak Shah. 2017. Cross-view image matching for geo-localization in urban environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3608–3616

work page 2017
[35]

Vicente Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. 2023. Geo- CLIP: Clip-Inspired Alignment between Locations and Images for Effective World- wide Geo-localization. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2023
[36]

Nam Vo and David W. Jacobs. 2017. Revisiting IM2GPS in the Deep Learning Era. arXiv preprint arXiv:1705.04838(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Zhiqiang Wang, Dejia Xu, Rana Muhammad Shahroz Khan, Yanbin Lin, Zhiwen Fan, and Xingquan Zhu. 2024. Llmgeo: Benchmarking large language models on image geolocation in-the-wild.arXiv preprint arXiv:2405.20363(2024)

work page arXiv 2024
[38]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

work page 2022
[39]

Tobias Weyand, Ilya Kostrikov, and James Philbin. 2016. PlaNet - Photo Geoloca- tion with Convolutional Neural Networks. InEuropean Conference on Computer Vision (ECCV). Springer, 37–55. https://doi.org/10.1007/978-3-319-46484-8_3

work page doi:10.1007/978-3-319-46484-8_3 2016
[40]

Scott Workman, Richard Souvenir, and Nathan Jacobs. 2015. Wide-area image geolocalization with aerial reference imagery. InProceedings of the IEEE Interna- tional Conference on Computer Vision. 3961–3969

work page 2015
[41]

Meiliu Wu and Qunying Huang. 2022. Im2city: image geo-localization via multi- modal learning. InProceedings of the 5th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery. 50–61

work page 2022
[42]

Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. 2024. Mind’s eye of LLMs: visualization-of-thought elicits spatial reasoning in large language models.Advances in Neural Information Processing Systems37 (2024), 90277–90317

work page 2024
[43]

Shixiong Xu, Chenghao Zhang, Lubin Fan, Gaofeng Meng, Shiming Xiang, and Jieping Ye. 2024. Addressclip: Empowering vision-language models for city-wide image address localization. InEuropean Conference on Computer Vision. Springer, 76–92

work page 2024
[44]

An Yan, Zhankui He, Jiacheng Li, Tianyang Zhang, and Julian McAuley. 2023. Personalized showcases: Generating multi-modal explanations for recommenda- tions. InProceedings of the 46th International ACM SIGIR Conference on Research Conference acronym ’XX, , Lingyao Li, Runlong Yu, Qikai Hu, Bowei Li, Min Deng, Yang Zhou, and Xiaowei Jia and Development in ...

work page 2023
[45]

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

work page
[46]

InProceedings of the Computer Vision and Pattern Recognition Conference

Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference. 10632–10643

work page
[47]

Qiang Yi and Lianlei Shan. 2025. Geolocsft: Efficient visual geolocation via supervised fine-tuning of multimodal foundation models.arXiv preprint arXiv:2506.01277(2025)

work page arXiv 2025
[48]

Wenping Yin, Yong Xue, Ziqi Liu, Hao Li, and Martin Werner. 2025. LLM- enhanced disaster geolocalization using implicit geoinformation from multimodal data: A case study of Hurricane Harvey.International Journal of Applied Earth Observation and Geoinformation137 (2025), 104423

work page 2025
[49]

Amir Roshan Zamir and Mubarak Shah. 2014. Image geo-localization based on multiplenearest neighbor feature matching usinggeneralized graphs.IEEE transactions on pattern analysis and machine intelligence36, 8 (2014), 1546–1558

work page 2014
[50]

Yanhua Zhong, Yuqiang Wu, Sheng Zheng, Yi Yang, and Zhiwu Ma. 2021. VIGOR: Cross-View Image Geo-Localization Beyond One-to-One Retrieval. In Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). 8636–

work page 2021
[51]

https://doi.org/10.1109/CVPR46437.2021.00853

work page doi:10.1109/cvpr46437.2021.00853 2021
[52]

Zhongliang Zhou, Jielu Zhang, Zihan Guan, Mengxuan Hu, Ni Lao, Lan Mu, Sheng Li, and Gengchen Mai. 2024. Img2Loc: Revisiting image geolocalization using multi-modality foundation models and image-based retrieval-augmented generation. InProceedings of the 47th international acm sigir conference on research and development in information retrieval. 2749–2754

work page 2024
[53]

You a r e an AI a s s i s t a n t s p e c i a l i z e d i n 2g e o c o d i n g a n a l y s i s from images

Sijie Zhu, Mubarak Shah, and Chen Chen. 2022. Transgeo: Transformer is all you need for cross-view image geo-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1162–1171. A Sample Images from Benchmark Datasets Sample images from our three datasets in the benchmark are pre- sented in Figure 7. B Data Distrib...

work page 2022

[1] [1]

Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomáš Pajdla, and Josef Sivic

work page

[2] [2]

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 5297–5307. https://doi.org/10.1109/CVPR.2016.572

work page doi:10.1109/cvpr.2016.572 2016

[3] [3]

Jan Brejcha and Martin Čadík. 2017. State-of-the-art in visual geo-localization. Pattern Analysis and Applications20, 3 (2017), 613–637

work page 2017

[4] [4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

work page 2020

[5] [5]

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. 14455–14465

work page 2024

[6] [6]

Brandon Clark, Alec Kerrigan, Parth Parag Kulkarni, Vicente Vivanco Cepeda, and Mubarak Shah. 2023. Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23182– 23190

work page 2023

[7] [7]

Lizhou Fan, Wenyue Hua, Xiang Li, Kaijie Zhu, Mingyu Jin, Lingyao Li, Haoyang Ling, Jinkui Chi, Jindong Wang, Xin Ma, et al. 2024. Nphardeval4v: A dynamic reasoning benchmark of multimodal large language models.arXiv preprint arXiv:2403.01777(2024)

work page arXiv 2024

[8] [8]

Jiaxian Guo, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Boyang Li, Dacheng Tao, and Steven Hoi. 2023. From images to textual prompts: Zero-shot visual question answering with frozen large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10867–10877

work page 2023

[9] [9]

James Hays and Alexei A Efros. 2008. IM2GPS: estimating geographic information from a single image. In2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1–8. https://doi.org/10.1109/CVPR.2008.4587784

work page doi:10.1109/cvpr.2008.4587784 2008

[10] [10]

James Hays and Alexei A Efros. 2014. Large-scale image geolocalization. In Multimodal location estimation of videos and images. Springer, 41–62

work page 2014

[11] [11]

Yujun Hou, Matias Quintana, Maxim Khomiakov, Winston Yap, Jiani Ouyang, Koichi Ito, Zeyu Wang, Tianhong Zhao, and Filip Biljecki. 2024. Global Streetscapes—A comprehensive dataset of 10 million street-level images across 688 cities for urban science and analytics.ISPRS Journal of Photogrammetry and Remote Sensing215 (2024), 216–238

work page 2024

[12] [12]

Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, and Fei Huang. 2024. mplug-paperowl: Scientific diagram analysis with the multimodal large language model. InProceedings of the 32nd ACM International Conference on Multimedia. 6929–6938

work page 2024

[13] [13]

Wenyue Hua, Kaijie Zhu, Lingyao Li, Lizhou Fan, Shuhang Lin, Mingyu Jin, Haochen Xue, Zelong Li, JinDong Wang, and Yongfeng Zhang. 2024. Disentan- gling logic: The role of context in large language model reasoning capabilities. arXiv preprint arXiv:2406.02787(2024)

work page arXiv 2024

[14] [14]

Neel Jay, Hieu Minh Nguyen, Trung Dung Hoang, and Jacob Haimes. 2025. Evaluating precise geolocation inference capabilities of vision language models. arXiv preprint arXiv:2502.14412(2025)

work page arXiv 2025

[15] [15]

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision. InProc. Int. Conf. on Machine Learning (ICML). 4904–4916

work page 2021

[16] [16]

Hyo Jin Kim, Enrique Dunn, and Jan-Michael Frahm. 2017. Learned contex- tual feature reweighting for image geo-localization. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2136–2145

work page 2017

[17] [17]

VijayaKumar Kadha, Sambit Bakshi, and Santos Kumar Das. 2025. Unravelling Digital Forgeries: A Systematic Survey on Image Manipulation Detection and Localization.Comput. Surveys57, 12 (2025), 1–36

work page 2025

[18] [18]

Jiayi Kuang, Ying Shen, Jingyou Xie, Haohao Luo, Zhe Xu, Ronghao Li, Yinghui Li, Xianfeng Cheng, Xika Lin, and Yu Han. 2025. Natural language understanding and inference with mllm in visual question answering: A survey.Comput. Surveys 57, 8 (2025), 1–36

work page 2025

[19] [19]

Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, et al. 2023. Geo-bench: Toward foundation models for earth monitoring. Advances in Neural Information Processing Systems36 (2023), 51080–51093

work page 2023

[20] [20]

Hao Li, Fabian Deuser, Wenping Yin, Xuanshu Luo, Paul Walther, Gengchen Mai, Wei Huang, and Martin Werner. 2025. Cross-view geolocalization and disaster mapping with street-view and VHR satellite imagery: A case study of Hurricane IAN.ISPRS Journal of Photogrammetry and Remote Sensing220 (2025), 841–854

work page 2025

[21] [21]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Boot- strapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. InProc. 40th Int. Conf. on Machine Learning (ICML), Vol. 202. PMLR, 19730–19742

work page 2023

[22] [22]

Lingyao Li, Dawei Li, Zhenhui Ou, Xiaoran Xu, Jingxiao Liu, Zihui Ma, Runlong Yu, and Min Deng. 2025. LLMs as World Models: Data-Driven and Human- Centered Pre-Event Simulation for Disaster Impact Assessment.arXiv preprint arXiv:2506.06355(2025)

work page arXiv 2025

[23] [23]

Ling Li, Yao Zhou, Yuxuan Liang, Fugee Tsung, and Jiaheng Wei. 2025. Recogni- tion through Reasoning: Reinforcing Image Geo-localization with Large Vision- Language Models.arXiv preprint arXiv:2506.14674(2025)

work page arXiv 2025

[24] [24]

Xin Li, Yunfei Wu, Xinghua Jiang, Zhihao Guo, Mingming Gong, Haoyu Cao, Yinsong Liu, Deqiang Jiang, and Xing Sun. 2024. Enhancing visual document understanding with contrastive learning in large visual-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15546–15555

work page 2024

[25] [25]

Tsung-Yi Lin, Serge Belongie, and James Hays. 2013. Cross-view image geolo- calization. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 891–898

work page 2013

[26] [26]

Eric Muller-Budack, Kader Pustu-Iren, and Ralph Ewerth. 2018. Geolocation estimation of photos using a hierarchical model and scene classification. In Proceedings of the European conference on computer vision (ECCV). 563–579

work page 2018

[27] [27]

Shraman Pramanick, Ewa M Nowara, Joshua Gleason, Carlos D Castillo, and Rama Chellappa. 2022. Where in the world is this image? transformer-based geo-localization in the wild. InEuropean Conference on Computer Vision. Springer, 196–215

work page 2022

[28] [28]

Alec Radford, Jong Wook Kim, and Christopheret al.Hallacy. 2021. Learning Transferable Visual Models from Natural Language Supervision. InProc. Int. Conf. on Machine Learning (ICML). 8748–8763

work page 2021

[29] [29]

Noe Samano, Mengjie Zhou, and Andrew Calway. 2020. You are here: Geolocation by embedding maps and images. InEuropean Conference on Computer Vision. Springer, 502–518

work page 2020

[30] [30]

Paul Hongsuck Seo, Tobias Weyand, Jack Sim, and Bohyung Han. 2018. Cplanet: Enhancing image geolocalization by combinatorial partitioning of maps. InPro- ceedings of the European Conference on Computer Vision (ECCV). 536–551

work page 2018

[31] [31]

David G Shatwell, Ishan Rajendrakumar Dave, Sirnam Swetha, and Mubarak Shah. 2025. GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space.arXiv preprint arXiv:2507.10473(2025)

work page arXiv 2025

[32] [32]

Zhicheng Shi, Yang Li, Siyu Li, and Jiebo Luo. 2020. SAFA: Structure-Aware Feature Aggregation for Cross-View Image-Based Geo-Localization. InProc. ACM Int. Conf. on Multimedia (MM). 1633–1641. https://doi.org/10.1145/3394171. 3413569

work page doi:10.1145/3394171 2020

[33] [33]

Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. 2025. Nuscenes-spatialqa: A spatial understanding and rea- soning benchmark for vision-language models in autonomous driving.arXiv preprint arXiv:2504.03164(2025)

work page arXiv 2025

[34] [34]

Yicong Tian, Chen Chen, and Mubarak Shah. 2017. Cross-view image matching for geo-localization in urban environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3608–3616

work page 2017

[35] [35]

Vicente Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. 2023. Geo- CLIP: Clip-Inspired Alignment between Locations and Images for Effective World- wide Geo-localization. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2023

[36] [36]

Nam Vo and David W. Jacobs. 2017. Revisiting IM2GPS in the Deep Learning Era. arXiv preprint arXiv:1705.04838(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[37] [37]

Zhiqiang Wang, Dejia Xu, Rana Muhammad Shahroz Khan, Yanbin Lin, Zhiwen Fan, and Xingquan Zhu. 2024. Llmgeo: Benchmarking large language models on image geolocation in-the-wild.arXiv preprint arXiv:2405.20363(2024)

work page arXiv 2024

[38] [38]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

work page 2022

[39] [39]

Tobias Weyand, Ilya Kostrikov, and James Philbin. 2016. PlaNet - Photo Geoloca- tion with Convolutional Neural Networks. InEuropean Conference on Computer Vision (ECCV). Springer, 37–55. https://doi.org/10.1007/978-3-319-46484-8_3

work page doi:10.1007/978-3-319-46484-8_3 2016

[40] [40]

Scott Workman, Richard Souvenir, and Nathan Jacobs. 2015. Wide-area image geolocalization with aerial reference imagery. InProceedings of the IEEE Interna- tional Conference on Computer Vision. 3961–3969

work page 2015

[41] [41]

Meiliu Wu and Qunying Huang. 2022. Im2city: image geo-localization via multi- modal learning. InProceedings of the 5th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery. 50–61

work page 2022

[42] [42]

Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. 2024. Mind’s eye of LLMs: visualization-of-thought elicits spatial reasoning in large language models.Advances in Neural Information Processing Systems37 (2024), 90277–90317

work page 2024

[43] [43]

Shixiong Xu, Chenghao Zhang, Lubin Fan, Gaofeng Meng, Shiming Xiang, and Jieping Ye. 2024. Addressclip: Empowering vision-language models for city-wide image address localization. InEuropean Conference on Computer Vision. Springer, 76–92

work page 2024

[44] [44]

An Yan, Zhankui He, Jiacheng Li, Tianyang Zhang, and Julian McAuley. 2023. Personalized showcases: Generating multi-modal explanations for recommenda- tions. InProceedings of the 46th International ACM SIGIR Conference on Research Conference acronym ’XX, , Lingyao Li, Runlong Yu, Qikai Hu, Bowei Li, Min Deng, Yang Zhou, and Xiaowei Jia and Development in ...

work page 2023

[45] [45]

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

work page

[46] [46]

InProceedings of the Computer Vision and Pattern Recognition Conference

Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference. 10632–10643

work page

[47] [47]

Qiang Yi and Lianlei Shan. 2025. Geolocsft: Efficient visual geolocation via supervised fine-tuning of multimodal foundation models.arXiv preprint arXiv:2506.01277(2025)

work page arXiv 2025

[48] [48]

Wenping Yin, Yong Xue, Ziqi Liu, Hao Li, and Martin Werner. 2025. LLM- enhanced disaster geolocalization using implicit geoinformation from multimodal data: A case study of Hurricane Harvey.International Journal of Applied Earth Observation and Geoinformation137 (2025), 104423

work page 2025

[49] [49]

Amir Roshan Zamir and Mubarak Shah. 2014. Image geo-localization based on multiplenearest neighbor feature matching usinggeneralized graphs.IEEE transactions on pattern analysis and machine intelligence36, 8 (2014), 1546–1558

work page 2014

[50] [50]

Yanhua Zhong, Yuqiang Wu, Sheng Zheng, Yi Yang, and Zhiwu Ma. 2021. VIGOR: Cross-View Image Geo-Localization Beyond One-to-One Retrieval. In Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). 8636–

work page 2021

[51] [51]

https://doi.org/10.1109/CVPR46437.2021.00853

work page doi:10.1109/cvpr46437.2021.00853 2021

[52] [52]

Zhongliang Zhou, Jielu Zhang, Zihan Guan, Mengxuan Hu, Ni Lao, Lan Mu, Sheng Li, and Gengchen Mai. 2024. Img2Loc: Revisiting image geolocalization using multi-modality foundation models and image-based retrieval-augmented generation. InProceedings of the 47th international acm sigir conference on research and development in information retrieval. 2749–2754

work page 2024

[53] [53]

You a r e an AI a s s i s t a n t s p e c i a l i z e d i n 2g e o c o d i n g a n a l y s i s from images

Sijie Zhu, Mubarak Shah, and Chen Chen. 2022. Transgeo: Transformer is all you need for cross-view image geo-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1162–1171. A Sample Images from Benchmark Datasets Sample images from our three datasets in the benchmark are pre- sented in Figure 7. B Data Distrib...

work page 2022