arxiv: 2604.16248 · v1 · submitted 2026-04-17 · 💻 cs.CV

Recognition: unknown

Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization

Siddhant Bharadwaj , Ashish Vashist , Fahimul Aleem , Shruti Vyas

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsimage geolocalizationzero-shot evaluationcountry predictionground-view imagerymultimodal reasoninggeographic inference

0 comments

The pith

Vision-language models show wide variation in zero-shot country prediction from ground-view images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates multiple state-of-the-art vision-language models on predicting the country of a ground-view image using only text prompts in a zero-shot setting, without GPS data, image matching, or task-specific training. Tests run across three geographically diverse datasets to measure robustness and generalization. A sympathetic reader would care because the results indicate how much location information VLMs can extract through semantic reasoning alone, which matters for applications relying on visual geographic understanding. The findings show substantial differences between models at the coarse country level while exposing consistent shortfalls in fine-grained cues.

Core claim

Modern vision-language models exhibit substantial variation in performance when performing country-level geolocalization of ground-view images through zero-shot prompt-based prediction across three diverse datasets, demonstrating the potential of semantic reasoning for coarse geographic inference while revealing limitations in capturing fine-grained geographic cues.

What carries the argument

Zero-shot prompt-based country prediction on vision-language models evaluated across three geographically diverse ground-view image datasets.

If this is right

Semantic reasoning in VLMs supports usable coarse country-level geolocalization from ground-view images.
Current VLMs will continue to miss subtle geographic cues needed for finer distinctions.
Performance differences across models will persist depending on architecture and training data.
This zero-shot evaluation establishes a baseline for tracking progress in multimodal geographic reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pairing VLMs with retrieval-based methods could compensate for their fine-grained weaknesses.
Training data geographic imbalances likely contribute to the observed regional performance gaps.
Extending the same prompt-based approach to city or landmark scale would test whether the coarse-to-fine pattern holds.
Model choice will matter in any real-world system that uses VLMs for initial location filtering.

Load-bearing premise

The three chosen datasets and specific zero-shot prompts provide an unbiased test of VLM geographic inference without hidden selection effects or prompt sensitivity.

What would settle it

A new model achieving uniformly high accuracy across all three datasets or a different prompt producing reversed performance rankings on the same images would challenge the reported variation and limitations.

Figures

Figures reproduced from arXiv: 2604.16248 by Ashish Vashist, Fahimul Aleem, Shruti Vyas, Siddhant Bharadwaj.

**Figure 1.** Figure 1: Overview of the evaluation pipeline. A ground-level image and a structured prompt are passed to each of the three VLM families (≤8B parameters), which output Top-5 country predictions. Performance is assessed via Top-1/Top-5 accuracy and a multi-dimensional evaluation framework comprising Environmental Stratification (urban/rural and six biome categories), Error Structure Analysis (neighbor hop distance), … view at source ↗

**Figure 2.** Figure 2: Illustrative examples of Geographic Error Reasonableness score (GER). [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Image geolocalization has traditionally been addressed through retrieval-based place recognition or geometry-based visual localization pipelines. Recent advances in Vision-Language Models (VLMs) have demonstrated strong zero-shot reasoning capabilities across multimodal tasks, yet their performance in geographic inference remains underexplored. In this work, we present a systematic evaluation of multiple state-of-the-art VLMs for country-level image geolocalization using ground-view imagery only. Instead of relying on image matching, GPS metadata, or task-specific training, we evaluate prompt-based country prediction in a zero-shot setting. The selected models are tested on three geographically diverse datasets to assess their robustness and generalization ability. Our results reveal substantial variation across models, highlighting the potential of semantic reasoning for coarse geolocalization and the limitations of current VLMs in capturing fine-grained geographic cues. This study provides the first focused comparison of modern VLMs for country-level geolocalization and establishes a foundation for future research at the intersection of multimodal reasoning and geographic understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a systematic zero-shot evaluation of multiple state-of-the-art Vision-Language Models for country-level image geolocalization from ground-view imagery alone. It tests the models via prompt-based country prediction on three geographically diverse datasets, avoiding retrieval, GPS metadata, or task-specific training, and reports substantial performance variation across models while highlighting semantic reasoning strengths for coarse localization and limitations for fine-grained geographic cues.

Significance. If the empirical findings prove robust, this work supplies a useful baseline comparison for assessing geographic understanding in VLMs, an area relevant to visual navigation, cultural analysis, and multimodal AI safety. The zero-shot, multi-model, multi-dataset design provides concrete evidence of current capabilities without confounding factors from fine-tuning, and the focus on failure modes offers actionable directions for improving semantic geographic reasoning.

major comments (2)

The evaluation protocol section does not include the exact prompt templates, model versions, or decoding parameters used for country prediction. This detail is load-bearing for the central claim of 'substantial variation' because prompt sensitivity could alter the observed differences between models, directly affecting the weakest assumption that the chosen prompts provide an unbiased test.
Results section and associated tables lack per-dataset accuracy breakdowns, confidence intervals, or error analysis (e.g., which countries are systematically confused). Without these, the claim that VLMs capture coarse but not fine-grained cues cannot be quantitatively verified and risks post-hoc interpretation of the observed outputs.

minor comments (2)

The abstract would benefit from one or two key quantitative results (e.g., best-model accuracy range) to give readers an immediate sense of effect size.
Figure captions and axis labels in the performance comparison plots should explicitly state the metric (top-1 country accuracy) and whether results are averaged across datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to improve reproducibility and quantitative support for our claims.

read point-by-point responses

Referee: The evaluation protocol section does not include the exact prompt templates, model versions, or decoding parameters used for country prediction. This detail is load-bearing for the central claim of 'substantial variation' because prompt sensitivity could alter the observed differences between models, directly affecting the weakest assumption that the chosen prompts provide an unbiased test.

Authors: We agree that these implementation details are critical for reproducibility and for validating that the reported model variations are not artifacts of prompt choice. In the revised manuscript, we will add a new subsection to the evaluation protocol that lists the exact prompt templates (including any system prompts or few-shot examples if used), the precise model versions and checkpoints, and all decoding parameters such as temperature, top-p, and maximum token limits. This addition will directly address the concern and allow independent verification of the results. revision: yes
Referee: Results section and associated tables lack per-dataset accuracy breakdowns, confidence intervals, or error analysis (e.g., which countries are systematically confused). Without these, the claim that VLMs capture coarse but not fine-grained cues cannot be quantitatively verified and risks post-hoc interpretation of the observed outputs.

Authors: We acknowledge that the current results presentation would benefit from greater granularity to strengthen the evidence for our interpretations. We will revise the results section to include per-dataset accuracy tables, bootstrap-derived confidence intervals for all reported metrics, and an expanded error analysis subsection. This analysis will quantify common country-level confusions (e.g., neighboring countries or those sharing visual or semantic features) and explicitly link them to the distinction between coarse semantic reasoning and fine-grained geographic cues. Representative confusion examples and aggregate statistics will be provided to keep the presentation concise while improving verifiability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation only

full rationale

The paper is a zero-shot empirical study that runs existing VLMs on three public datasets with fixed prompts and reports observed country-level prediction accuracies. No equations, derivations, fitted parameters, or self-citation chains are present. All claims rest on direct model outputs rather than any reduction to author-defined quantities or prior self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the representativeness of the three datasets and the fairness of the zero-shot prompt protocol; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The three geographically diverse datasets provide a fair test of country-level inference without selection bias.
Invoked when stating that models were tested on three datasets to assess robustness.

pith-pipeline@v0.9.0 · 5480 in / 1159 out tokens · 29888 ms · 2026-05-10T08:28:54.657255+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 8 canonical work pages · 4 internal anchors

[1]

Netvlad: Cnn architecture for weakly supervised place recognition

Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pa- jdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016. 1, 2

2016
[2]

OpenStreetView-5M: The Many Roads to Global Visual Ge- olocation

Guillaume Astruc, Nicolas Dufour, Ioannis Siglidis, Con- stantin Aronssohn, Nacim Bouia, Stephanie Fu, Romain Loiseau, Van Nguyen Nguyen, Charles Raude, Elliot Vincent, Lintao Xu, Hongyu Zhou, and Loic Landrieu. OpenStreetView-5M: The Many Roads to Global Visual Ge- olocation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni...

2024
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understand- ing, localization, text reading, and beyond. 2023. doi: 10. 48550.arXiv preprint ARXIV .2308.12966. 1

work page internal anchor Pith review arXiv 2023
[4]

Deep visual geo-localization benchmark

Gabriele Berton, Riccardo Mereu, Gabriele Trivigno, Carlo Masone, Gabriela Csurka, Torsten Sattler, and Barbara Ca- puto. Deep visual geo-localization benchmark. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5396–5407, 2022. 2

2022
[5]

Adaptive-attentive geolocalization from few queries: A hybrid approach

Gabriele Moreno Berton, Valerio Paolicelli, Carlo Masone, and Barbara Caputo. Adaptive-attentive geolocalization from few queries: A hybrid approach. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2918–2927, 2021. 1

2021
[6]

Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss

Sudong Cai, Yulan Guo, Salman Khan, Jiwei Hu, and Gongjian Wen. Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss. InProceedings of the IEEE/CVF international conference on computer vision, pages 8391–8400, 2019. 1

2019
[7]

Gaea: A geolocation aware conversational assistant, 2025

Ron Campos, Ashmal Vayani, Parth Parag Kulkarni, Rohit Gupta, Aritra Dutta, and Mubarak Shah. Gaea: A geolocation aware conversational assistant, 2025. 6

2025
[8]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 3

work page internal anchor Pith review arXiv 2024
[9]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1

2024
[10]

RetinaFace: Single-Shot Multi- Level Face Localisation in the Wild

Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. RetinaFace: Single-Shot Multi- Level Face Localisation in the Wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5203–5212, 2020. 5

2020
[11]

Self-supervising fine-grained region similarities for large- scale image localization

Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, and Hongsheng Li. Self-supervising fine-grained region similarities for large- scale image localization. InEuropean conference on computer vision, pages 369–386. Springer, 2020. 1

2020
[12]

End-to-end learning of deep visual representations for image retrieval.International Journal of Computer Vision, 124(2):237–254, 2017

Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Lar- lus. End-to-end learning of deep visual representations for image retrieval.International Journal of Computer Vision, 124(2):237–254, 2017. 2

2017
[13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Aggregating local image descriptors into compact codes.IEEE transac- tions on pattern analysis and machine intelligence, 34(9): 1704–1716, 2011

Herv´e J ´egou, Florent Perronnin, Matthijs Douze, Jorge S´anchez, Patrick P´erez, and Cordelia Schmid. Aggregating local image descriptors into compact codes.IEEE transac- tions on pattern analysis and machine intelligence, 34(9): 1704–1716, 2011. 2

2011
[15]

Learned contextual feature reweighting for image geo- localization

Hyo Jin Kim, Enrique Dunn, and Jan-Michael Frahm. Learned contextual feature reweighting for image geo- localization. InProceedings of the IEEE conference on com- puter vision and pattern recognition, pages 2136–2145, 2017. 1

2017
[16]

GeoLocation – GeoGuessr Images (50K)

Rohan K. GeoLocation – GeoGuessr Images (50K). Kaggle,
[17]

Accessed: 2025. 4, 5

2025
[18]

CityGuessr: City-Level Video Geo-Localization on a Global Scale

Parth Parag Kulkarni, Gourav Kumar Nayak, and Mubarak Shah. CityGuessr: City-Level Video Geo-Localization on a Global Scale. InEuropean Conference on Computer Vision (ECCV), pages 289–306. Springer, 2024. 5

2024
[19]

Unleashing unlabeled data: A paradigm for cross-view geo-localization

Guopeng Li, Ming Qian, and Gui-Song Xia. Unleashing unlabeled data: A paradigm for cross-view geo-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16719–16729, 2024. 1

2024
[20]

Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR,
[21]

Densernet: Weakly supervised visual localization using multi-scale feature aggregation

Dongfang Liu, Yiming Cui, Liqi Yan, Christos Mousas, Bai- jian Yang, and Yingjie Chen. Densernet: Weakly supervised visual localization using multi-scale feature aggregation. In Proceedings of the AAAI conference on artificial intelligence, pages 6101–6109, 2021. 1

2021
[22]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2

2023
[23]

Llava-next: Improved reason- ing, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reason- ing, ocr, and world knowledge, 2024. 3

2024
[24]

Globalgeotree: a multi-granular vision-language dataset for global tree species classification.Earth System Science Data, 18(2):1379–1403, 2026

Yang Mu, Zhitong Xiong, Yi Wang, Muhammad Shahzad, Franz Essl, Holger Kreft, Mark van Kleunen, and Xiao Xiang Zhu. Globalgeotree: a multi-granular vision-language dataset for global tree species classification.Earth System Science Data, 18(2):1379–1403, 2026. 2

2026
[25]

Fisher kernels on visual vocabularies for image categorization

Florent Perronnin and Christopher Dance. Fisher kernels on visual vocabularies for image categorization. In2007 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2007. 2

2007
[26]

Large language models sensitivity to the order of options in multiple-choice questions.arXiv, 2023

Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions.arXiv, 2023. 8

2023
[27]

Object retrieval with large vocabularies and fast spatial matching

James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. In2007 IEEE conference on com- puter vision and pattern recognition, pages 1–8. IEEE, 2007. 2

2007
[28]

Benchmarking image retrieval for visual localization

No´e Pion, Martin Humenberger, Gabriela Csurka, Yohann Cabon, and Torsten Sattler. Benchmarking image retrieval for visual localization. In2020 International Conference on 3D Vision (3DV), pages 483–494. IEEE, 2020. 2

2020
[29]

Where on earth? a vision-language bench- mark for probing model geolocation skills across scales.arXiv preprint arXiv:2510.10880, 2025

Zhaofang Qian, Hardy Chen, Zeyu Wang, Li Zhang, Zijun Wang, Xiaoke Huang, Hui Liu, Xianfeng Tang, Zeyu Zheng, Haoqin Tu, et al. Where on earth? a vision-language bench- mark for probing model geolocation skills across scales.arXiv preprint arXiv:2510.10880, 2025. 1

work page arXiv 2025
[30]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1

2021
[31]

Charting new territories: Exploring the geographic and geospatial capabilities of multimodal llms

Jonathan Roberts, Timo L¨uddecke, Rehan Sheikh, Kai Han, and Samuel Albanie. Charting new territories: Exploring the geographic and geospatial capabilities of multimodal llms. arXiv preprint arXiv:2311.14656, 2024. 3

work page arXiv 2024
[32]

Semgeo: Seman- tic keywords for cross-view image geo-localization

Royston Rodrigues and Masahiro Tani. Semgeo: Seman- tic keywords for cross-view image geo-localization. In ICASSP 2023-2023 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 2

2023
[33]

Uav pose estimation using cross-view geolocalization with satellite imagery

Akshay Shetty and Grace Xingxin Gao. Uav pose estimation using cross-view geolocalization with satellite imagery. In 2019 International Conference on Robotics and Automation (ICRA), pages 1827–1833. IEEE, 2019. 1

2019
[34]

Order mat- ters: Exploring order sensitivity in multimodal large language models.arXiv, 2024

Zhijie Tan, Xu Chu, Weiping Li, and Tong Mo. Order mat- ters: Exploring order sensitivity in multimodal large language models.arXiv, 2024. 8

2024
[35]

Advancements in vision– language models for remote sensing: Datasets, capabilities, and enhancement techniques.Remote Sensing, 17(1):162,

Lijie Tao, Haokui Zhang, Haizhao Jing, Yu Liu, Dawei Yan, Guoting Wei, and Xizhe Xue. Advancements in vision– language models for remote sensing: Datasets, capabilities, and enhancement techniques.Remote Sensing, 17(1):162,
[36]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking mul- timodal understanding across millions of tokens of context, 2024.URL https://arxiv. org/abs/2403.05530, 2403, 2024. 2

work page internal anchor Pith review arXiv 2024
[37]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 3

2025
[38]

Localizing and orienting street views using overhead imagery

Nam N V o and James Hays. Localizing and orienting street views using overhead imagery. InEuropean conference on computer vision, pages 494–509. Springer, 2016. 1

2016
[39]

Gama: Cross- view video geo-localization

Shruti Vyas, Chen Chen, and Mubarak Shah. Gama: Cross- view video geo-localization. InEuropean Conference on Computer Vision, pages 440–456. Springer, 2022. 2

2022
[40]

Gre suite: Geo-localization infer- ence via fine-tuned vision-language models and enhanced reasoning chains.arXiv preprint arXiv:2505.18700, 2025

Chun Wang, Xiaojun Ye, Xiaoran Pan, Zihao Pan, Haofan Wang, and Yiren Song. Gre suite: Geo-localization infer- ence via fine-tuned vision-language models and enhanced reasoning chains.arXiv preprint arXiv:2505.18700, 2025. 2

work page arXiv 2025
[41]

On the location depen- dence of convolutional neural network features

Scott Workman and Nathan Jacobs. On the location depen- dence of convolutional neural network features. InProceed- ings of the IEEE conference on computer vision and pattern recognition workshops, pages 70–78, 2015. 1

2015
[42]

Where am i? cross- view geo-localization with natural language descriptions

Junyan Ye, Honglin Lin, Leyan Ou, Dairong Chen, Zihao Wang, Qi Zhu, Conghui He, and Weijia Li. Where am i? cross- view geo-localization with natural language descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5890–5900, 2025. 2

2025
[43]

Cross- view image sequence geo-localization

Xiaohan Zhang, Waqas Sultani, and Safwan Wshah. Cross- view image sequence geo-localization. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2914–2923, 2023. 1

2023
[44]

Geox-bench: Benchmarking cross-view geo-localization and pose estimation capabilities of large multimodal models

Yushuo Zheng, Jiangyong Ying, Huiyu Duan, Chunyi Li, Zicheng Zhang, Jing Liu, Xiaohong Liu, and Guangtao Zhai. Geox-bench: Benchmarking cross-view geo-localization and pose estimation capabilities of large multimodal models. arXiv preprint arXiv:2511.13259, 2025. 1 Supplementary Material Where Do Vision-Language Models Fail? World Scale Analysis for Image...

work page arXiv 2025
[45]

I am showing you two images from two different locations in the world. For each image, predict the top 5 countries it is most likely from. Respond only in JSON format

Qualitative Analysis Table 7 presents qualitative examples of neighboring-country confusion produced byQwen3-VL-4B-Instructunder a blind two-turn prompting setup. Three image pairs are shown, each drawn from geographically close countries: (1) Pakistan vs. India, (2) Bangladesh vs. India, and (3) USA vs. Canada. For each pair, the model is given the same ...
[46]

Top-1: India✓

Canada Stone arch + arid landscape→defaulted to India. Top-1: India✓
[47]

Top-1: USA✗

Canada Ancient temple, stone fortification→India ranked #1. Top-1: USA✗
[48]

Top-1: India✗

Argentina Street sign + American-style lamppost mis- led model. Top-1: India✗
[49]

Top-1: USA✗

Vietnam Dense tropical forest→South/SE Asia. Top-1: USA✗
[50]

Top-1: USA✓

Mexico Suburban homes + fire hydrant→ de- faulted to USA. Top-1: USA✓
[51]

Mexico Fire hydrant + utility poles→USA ranked #1
[52]

It reports the frequency of each biome category under this strict consensus criterion, providing an overview of the recognized environmental classes across datasets

Biome Label Distribution Table 8 summarizes the distribution of consensus biome labels across the three datasets, considering only the samples where all three annotators-CLIP, SigLIP, and Qwen3-VL-4B-are in agreement. It reports the frequency of each biome category under this strict consensus criterion, providing an overview of the recognized environmenta...
[53]

a tropical rainforest or jungle scene

Biome Classification Prompts Table 9 lists the text prompts used for zero-shot biome classification. For CLIP and SigLIP, each prompt is used as a text embedding and the highest cosine similarity biome is assigned. For Qwen3-VL-4B, the six category names and descriptions are provided directly and the model is asked to select one. Table 9. Prompts used for...
[54]

The data highlights a dominant performance by the Qwen3-VL family, particularly the 2B and 4B variants, which achieve the highest scores in most categories

Top-5 Biome Accuracy Table 10 presents the Top-5 accuracy for nine Vision-Language Models (VLMs) across three datasets and six environmental biomes: Arid, Boreal, Mediterranean, Temperate, Tropical, and Tundra. The data highlights a dominant performance by the Qwen3-VL family, particularly the 2B and 4B variants, which achieve the highest scores in most c...
[55]

Top-1 accuracy improves from 40.40% to79.44%(+39.04 pp)

Preliminary LoRA Fine-Tuning To assess parameter-efficient fine-tuning, we fine-tune InternVL2.5-1B using LoRA ( r=8, α=32, all-linear modules) on a top-20 country subset of GeoGuessr-50k (12,116 train / 3,030 test images), training for 2 epochs at 5×10 −5. Top-1 accuracy improves from 40.40% to79.44%(+39.04 pp). The gain is larger for rural scenes (+45.9...
[56]

Full Urban/Rural and GER Tables Table 11 extends Table 4 (main paper) with per-cell standard deviations. Urban accuracy is highly stable across labellers (std < 1 pp for most cells), while rural variance reaches 5–6 pp, confirming that annotator disagreement concentrates on the rural boundary. Top-5 gaps are consistently narrower than Top-1, e.g., Qwen3-V...