pith. sign in

arxiv: 2606.07172 · v1 · pith:EYOAYNDDnew · submitted 2026-06-05 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Textual Supervision Enhances Geospatial Representations in Vision-Language Models

Pith reviewed 2026-06-27 22:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords geospatial representationsvision-language modelstextual supervisionimage geolocationspatial reasoningmultimodal learninglocalizabilityCLIP
0
0 comments X

The pith

Textual supervision improves how vision models learn to represent locations in images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares vision-only models such as ViT with vision-language models such as CLIP and larger multimodal systems such as LLaVA, Qwen, and Gemma on their ability to handle geospatial information. It groups test images into clusters of people, landmarks, and objects according to how easily their locations can be inferred, then measures performance differences across these groups. The results indicate that models trained with text alongside images acquire stronger spatial representations than vision-only models, with the largest multimodal models performing best. A sympathetic reader would care because geospatial understanding underpins practical tasks like image geolocation and spatial reasoning, and the work points to language as a practical way to strengthen that capability. If correct, the finding supports shifting more training effort toward multimodal data for location-sensitive applications.

Core claim

Vision-language models acquire stronger geospatial representations than vision-only architectures, and large-scale multimodal foundation models show further gains; this pattern demonstrates that textual supervision functions as an effective complementary modality for encoding spatial context during learning.

What carries the argument

Evaluation across image clusters grouped by degree of localizability, which isolates how well each model family infers location from visual content alone.

If this is right

  • Vision-language models consistently outperform vision-only models on spatial accuracy across all localizability groups.
  • Larger multimodal foundation models extend the gains from textual supervision.
  • Language serves as a complementary signal that helps encode spatial information not fully captured by pixels alone.
  • Multimodal training constitutes a productive direction for improving geospatial capabilities in AI systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training pipelines that pair images with descriptive text may prove especially useful for downstream tasks requiring location inference.
  • The same textual supervision mechanism could be tested on other spatial domains such as indoor navigation or satellite imagery analysis.
  • If language helps with geospatial encoding, similar benefits might appear in related multimodal problems that involve relational reasoning over scenes.

Load-bearing premise

The chosen image clusters and their localizability grouping measure genuine differences in geospatial understanding rather than artifacts of training data or evaluation setup.

What would settle it

A controlled comparison in which a vision-only model trained on identical data and scale as a vision-language model matches or exceeds it on the localizability-grouped clusters would falsify the enhancement claim.

Figures

Figures reproduced from arXiv: 2606.07172 by Bryan Nathanael Wijaya, Cheng Yaw Low, Fernando Tonucci, Jea Kwon, Luiz Felipe Vecchietti, Marcelo Sartori Locatelli, Meeyoung Cha, Virgilio Almeida.

Figure 1
Figure 1. Figure 1: Schematic illustration of our linear probing analysis setup. We fit ridge regression models W on the [CLS] or last token residuals A(l) from each layer l to investigate whether geospatial information (i.e., latitude and longitude) can be extracted from these hidden dimensions based on the R 2 value. location encoder. Their results demonstrate that high-quality geospatial representations transfer beyond geo… view at source ↗
Figure 2
Figure 2. Figure 2: Model performance measured by coefficient of determination R 2 across all models. The x-axis shows image clusters based on the YFCC100M dataset and the Google Landmarks dataset, and the y-axis lists the models compared. Higher R 2 values (darker colors in the heatmap) indicate better geolocation-prediction accuracy across clusters. Unlocalizable (R² 0) Very Localizable (R² = 1) llava-1.5-7b clip-vit-large … view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of image cluster localizability for each model, where each image serves as a representative of a cluster (Ap￾pendix B). The images to the right represent clusters with better localizability (higher R 2 ). For high-performing models, highly localizable landmarks usually achieve the highest performance. 4.2. Geospatial Representations Across Layers To investigate where geospatial representation … view at source ↗
Figure 4
Figure 4. Figure 4: Probe R 2 performance by layer of the models for different clusters and datasets with varying levels of localizability. (a) R 2 performance when no textual prompt is given. (b) R 2 performance when adding a textual prompt to the input, asking the model to predict the image geolocation. The R 2 is kept stable throughout the last layers when compared to the decaying performance observed in the non-prompting … view at source ↗
Figure 5
Figure 5. Figure 5: Probe predictive performance R 2 as a function of the retained feature proportion p, illustrating the capacity of the embed￾ding subspace needed to reach the maximum R 2 for both vision￾only and VLMs. Higher values of p correspond to a larger subset of the latent representation. 4.5. Steering the Model Generation Through Representation Swapping We examine the possibility of steering the text generated by a… view at source ↗
Figure 6
Figure 6. Figure 6: Schematic illustration showing that editing the geospatial representations (through dimension swapping) changes the perceived geolocation during token generation of VLMs. We demonstrated this finding on Qwen2.5-VL-3B with the methodology described in Section 4.5. 5. Discussion Our findings demonstrate that the training methodology is crucial for learning geospatial representations in vision-only models and… view at source ↗
Figure 7
Figure 7. Figure 7: Random samples for each of the 40 clusters obtained for the YFCC100M dataset. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Two-dimensional visualization of the 40 clusters obtained using UMAP and colored according to macro-category. For better visualization of the clusters obtained, we projected a small subset of 500 points per cluster into a 2D space using Uniform Manifold Approximation and Projection (UMAP) (McInnes et al., 2018). The generated plot ( [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Spatial distribution of all samples from our subset of YFCC100M [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: World coverage of random sampling (left) and geocells-based sampling (right). while the participation of G7 countries decreases from 57% to 44% [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Spatial distribution of all samples from our subset of Google Landmarks V2 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Marked SSI geo-bias values for all models on our subset of Google Landmarks V2 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: More in-depth look into Gemma-3-12B-PT predictions in our subset of Google Landmarks V2. Median geodesic error per 4 ◦ × 4 ◦ grid (left). All predictions are colored by the ground truth continent (right). The results for this analysis match our previous observations, with the MAE model, on average, showing the weakest evidence of internal geospatial representations, followed by DINO and finally by CLIP. A… view at source ↗
Figure 14
Figure 14. Figure 14: Performance (R 2 ) of each model when using a non-linear probe. The x-axis shows different clusters of the YFCC100M dataset and the Landmarks dataset, while the y-axis shows the models evaluated. significant increase in R2 for any model, thus strengthening the claim that vision-language models’ representations encode geospatial information better. F. Ablation Studies Our probing setup utilizes the summary… view at source ↗
Figure 15
Figure 15. Figure 15: Performance (R 2 ) of each model when using a linear probe and the concatenation of max and min pooling across input tokens as features. The x-axis shows different clusters of the YFCC100M dataset and the Landmarks dataset, while the y-axis shows the models evaluated. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: R 2 for different prompting strategies on the landmarks dataset. Note how the use of prompts requiring geospatial awareness leads to increased R 2 across studied models towards the latter layers. • City/Country: “What country and city was this picture taken in? Answer only with the city and country names.” • Photo Location: “Where is this photo?” It is possible to see that, on average, the more explicit p… view at source ↗
Figure 17
Figure 17. Figure 17: Training and validation loss for model fine-tuning. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Model performance measured by coefficient of determination R 2 across all models after the removal of images containing country and coordinate-related captions. The x-axis shows image clusters based on the YFCC100M dataset, and y-axis lists the models compared. Higher R 2 values (darker colors in the heatmap) indicate better geolocation-prediction accuracy across clusters. 29 [PITH_FULL_IMAGE:figures/ful… view at source ↗
read the original abstract

Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations acquired by three model families: vision-only architectures (e.g., ViT), vision-language models (e.g., CLIP), and large-scale multimodal foundation models (e.g., LLaVA, Qwen, and Gemma). By evaluating across image clusters, including people, landmarks, and everyday objects, grouped based on the degree of localizability, we reveal systematic gaps in spatial accuracy and show that textual supervision enhances the learning of geospatial representations. Our findings suggest the role of language as an effective complementary modality for encoding spatial context and multimodal learning as a key direction for advancing geospatial AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that textual supervision enhances geospatial representations in vision-language models. It supports this by comparing three model families—vision-only (e.g., ViT), vision-language (e.g., CLIP), and large-scale multimodal (e.g., LLaVA, Qwen, Gemma)—on image clusters (people, landmarks, everyday objects) grouped by localizability, revealing systematic gaps in spatial accuracy that are attributed to the presence of language supervision.

Significance. If substantiated, the finding would highlight language as a complementary modality for encoding spatial context and position multimodal learning as important for geospatial AI tasks such as image geolocation. The work identifies an underexplored dimension and suggests a concrete direction for model development.

major comments (2)
  1. [Abstract] The central claim attributes performance differences to textual supervision, yet the model families compared (ViT vs. CLIP vs. LLaVA/Qwen/Gemma) differ in parameter count, pretraining data volume/diversity, and objectives beyond language; no controlled ablations that hold vision backbone, data, and scale fixed while varying only textual supervision are described. This confound is load-bearing for the attribution.
  2. [Abstract] The localizability grouping of clusters (people, landmarks, objects) is presented as a measure of geospatial understanding, but no analysis addresses whether this grouping itself reflects training-data biases rather than intrinsic spatial capability; the abstract provides no quantitative results, metrics, error bars, or statistical tests to support the reported systematic gaps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to address these important points. We provide point-by-point responses below and outline planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] The central claim attributes performance differences to textual supervision, yet the model families compared (ViT vs. CLIP vs. LLaVA/Qwen/Gemma) differ in parameter count, pretraining data volume/diversity, and objectives beyond language; no controlled ablations that hold vision backbone, data, and scale fixed while varying only textual supervision are described. This confound is load-bearing for the attribution.

    Authors: We acknowledge that the manuscript compares representative models from established families rather than conducting controlled ablations that fix the vision backbone, dataset, and scale while varying only the presence of textual supervision. The comparisons reflect real-world model families to illustrate observed trends in geospatial accuracy. We will add an explicit discussion of these confounds, including a limitations paragraph noting that factors such as scale and data diversity may contribute to the differences, while emphasizing that the consistent pattern across families still supports the role of language supervision as a complementary signal. revision: partial

  2. Referee: [Abstract] The localizability grouping of clusters (people, landmarks, objects) is presented as a measure of geospatial understanding, but no analysis addresses whether this grouping itself reflects training-data biases rather than intrinsic spatial capability; the abstract provides no quantitative results, metrics, error bars, or statistical tests to support the reported systematic gaps.

    Authors: The grouping is derived from the intrinsic properties of the depicted content (e.g., landmarks permit precise localization while people and generic objects do not). We agree that potential alignment with training-data biases was not explicitly analyzed and will add a short subsection examining this possibility, for instance by checking overlap with common pretraining corpora. We will also revise the abstract to include key quantitative metrics, error bars, and references to the statistical tests used to establish the systematic gaps. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model comparison with independent evaluation

full rationale

The paper is a purely empirical study comparing vision-only, vision-language, and multimodal models on geospatial tasks using image clusters grouped by localizability. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim rests on observed performance gaps rather than reducing to inputs by construction. External benchmarks (model families) are treated as given, with no self-definitional loops or ansatz smuggling. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical analysis with no mathematical derivations, free parameters, axioms, or invented entities described in the abstract.

pith-pipeline@v0.9.1-grok · 5695 in / 827 out tokens · 15810 ms · 2026-06-27T22:04:21.261277+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 3 linked inside Pith

  1. [1]

    International Conference on Learning Representations , year=

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

  2. [2]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  3. [3]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  4. [4]

    Demystifying

    Xu, Hu and Xie, Saining and Tan, Xiaoqing Ellen and Huang, Po-Yao and Howes, Russell and Sharma, Vasu and Li, Shang-Wen and Ghosh, Gargi and Zettlemoyer, Luke and Feichtenhofer, Christoph , booktitle=. Demystifying

  5. [5]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

    Fan, David and Tong, Shengbang and Zhu, Jiachen and Sinha, Koustuv and Liu, Zhuang and Chen, Xinlei and Rabbat, Michael and Ballas, Nicolas and LeCun, Yann and Bar, Amir and Xie, Saining , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

  6. [6]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

    Deep residual learning for image recognition , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  7. [7]

    Transformer Circuits Thread , volume=

    A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=

  8. [8]

    Advances in Neural Information Processing Systems , volume=

    Attention is all you need , author=. Advances in Neural Information Processing Systems , volume=

  9. [9]

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=

  10. [10]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  11. [11]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  12. [12]

    arXiv preprint arXiv:2503.19786 , year=

    Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

  13. [13]

    arXiv preprint arXiv:2212.06727 , year=

    What do vision transformers learn? a visual exploration , author=. arXiv preprint arXiv:2212.06727 , year=

  14. [14]

    International Conference on Machine Learning , pages=

    Learning transferable visual models from natural language supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  15. [15]

    Transactions on Machine Learning Research , issn=

    Maxime Oquab and Timoth. Transactions on Machine Learning Research , issn=

  16. [16]

    Transactions on Machine Learning Research , issn=

    Oriane Sim. Transactions on Machine Learning Research , issn=

  17. [17]

    Zhou, Jinghao and Wei, Chen and Wang, Huiyu and Shen, Wei and Xie, Cihang and Yuille, Alan and Kong, Tao , booktitle=. Image

  18. [18]

    Advances in Neural Information Processing Systems , volume=

    Visual instruction tuning , author=. Advances in Neural Information Processing Systems , volume=

  19. [19]

    5-vl technical report , author=

    Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

  20. [20]

    2024 , booktitle=

    Language Models Represent Space and Time , author=. 2024 , booktitle=

  21. [21]

    Charting new territories: Exploring the geographic and geospatial capabilities of multimodal

    Roberts, Jonathan and L. Charting new territories: Exploring the geographic and geospatial capabilities of multimodal. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  22. [22]

    Vivanco Cepeda, Vicente and Nayak, Gaurav Kumar and Shah, Mubarak , journal=

  23. [23]

    The Thirteenth International Conference on Learning Representations , year=

    Linear Representations of Political Perspective Emerge in Large Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  24. [24]

    2017 , journal=

    Understanding intermediate layers using linear classifier probes , author=. 2017 , journal=

  25. [25]

    2016 , publisher=

    Thomee, Bart and Shamma, David A and Friedland, Gerald and Elizalde, Benjamin and Ni, Karl and Poland, Douglas and Borth, Damian and Li, Li-Jia , journal=. 2016 , publisher=

  26. [26]

    2009 , organization=

    Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=. 2009 , organization=

  27. [27]

    Psychometrika , volume=

    Who belongs in the family? , author=. Psychometrika , volume=. 1953 , publisher=

  28. [28]

    2016 , organization=

    Weyand, Tobias and Kostrikov, Ilya and Philbin, James , booktitle=. 2016 , organization=

  29. [29]

    Haas, Lukas and Skreta, Michal and Alberti, Silas and Finn, Chelsea , booktitle=

  30. [30]

    , booktitle=

    Hays, James and Efros, Alexei A. , booktitle=. 2008 , volume=

  31. [31]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  32. [32]

    arXiv preprint arXiv:1409.1556 , year=

    Very deep convolutional networks for large-scale image recognition , author=. arXiv preprint arXiv:1409.1556 , year=

  33. [33]

    Technometrics , volume=

    Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter , author=. Technometrics , volume=. 1979 , publisher=

  34. [34]

    2011 , isbn =

    Han, Jiawei and Kamber, Micheline and Pei, Jian , title =. 2011 , isbn =

  35. [35]

    Moayeri, Mazda and Tabassi, Elham and Feizi, Soheil , booktitle=

  36. [36]

    Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

    On the Scaling Laws of Geographical Representation in Language Models , author=. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

  37. [37]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Geolocation Representation from Large Language Models are Generic Enhancers for Spatio-Temporal Learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  38. [38]

    MediaEval Benchmarking Initiative for Multimedia Evaluation , year=

    The Placing Task at MediaEval 2016 , author=. MediaEval Benchmarking Initiative for Multimedia Evaluation , year=

  39. [39]

    McInnes, Leland and Healy, John and Melville, James , journal=

  40. [40]

    Wu, Nemin and Cao, Qian and Wang, Zhangyu and Liu, Zeping and Qi, Yanlin and Zhang, Jielu and Ni, Joshua and Yao, Xiaobai and Ma, Hongxu and Mu, Lan and others , journal=

  41. [41]

    International Conference on Machine Learning , year=

    Position: The Platonic Representation Hypothesis , author=. International Conference on Machine Learning , year=