pith. sign in

arxiv: 2606.24997 · v1 · pith:LFCIGRLRnew · submitted 2026-06-23 · 💻 cs.LG

What's in an Earth Embedding? An Explainability Analysis of Location Encoders

Pith reviewed 2026-06-26 00:13 UTC · model grok-4.3

classification 💻 cs.LG
keywords location embeddingsgeographic INRsexplainabilitysparse autoencodersSpLiCECLIP Surgerygeospatial representationsinterpretability
0
0 comments X

The pith

Location embeddings from geographic neural networks decompose into human-interpretable features like forests, deserts, and urban structures while keeping high reconstruction accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether location embeddings produced by geographic implicit neural representations contain recoverable semantic and visual information about the Earth. It applies three decomposition techniques to these embeddings: sparse autoencoders to extract latent concepts, SpLiCE over a geospatial dictionary to recover natural language concepts, and CLIP Surgery to generate visual saliency maps. The decompositions succeed in isolating recognizable geographic patterns while preserving the ability to reconstruct the original embeddings. Different methods surface complementary scales of information, from fine urban details to broad biome and climate signals. The results supply auditing tools for what off-the-shelf geospatial embeddings actually encode.

Core claim

Geographic INRs learn location embeddings that encode geospatial data and can be decomposed into sparse latent concepts via sparse autoencoders, natural language concepts via SpLiCE over a predefined geospatial dictionary, and visual features via CLIP Surgery saliency maps; these decompositions retain high reconstruction capability and expose interpretable structures such as forests, deserts, and urban features, with systematic differences across methods that range from urban structures to broader biome and climate signals.

What carries the argument

Decomposition of location embeddings into sparse latent concepts, natural language concepts, and visual saliency features using sparse autoencoders, SpLiCE, and CLIP Surgery.

If this is right

  • Sparse decompositions reveal systematic differences in encoded information, from local urban structures to wider biome and climate signals.
  • Saliency maps from pretraining space highlight complementary visual features such as roads and landmarks.
  • Location embeddings can be audited for the geographic or semantic information they capture.
  • Human-interpretable representations can be obtained from the embeddings without major loss of reconstruction fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition approach could be tested on embeddings from other coordinate-based models to check whether interpretability generalizes beyond geographic INRs.
  • Practitioners could select among embedding methods by inspecting which geographic scales each one emphasizes in its decompositions.
  • If the recovered concepts align with real satellite or map data, the embeddings may be carrying implicit supervision from the pretraining distribution.
  • Downstream tasks that rely on these embeddings might gain from an intermediate step that filters or weights the extracted concepts.

Load-bearing premise

The chosen decomposition techniques recover the actual semantic and visual content present in the embeddings without introducing substantial artifacts or omissions.

What would settle it

Reconstruction error rises sharply or known geographic features such as forests disappear from the extracted concepts when the same decomposition methods are applied to the embeddings.

Figures

Figures reproduced from arXiv: 2606.24997 by Adam J. Stewart, Esther Rolf, Ivica Obadic, Livia Betti, Sebastian Ricke.

Figure 1
Figure 1. Figure 1: Overview of our decomposition framework for geographic INRs. A geographic INR maps coordinate inputs to a location embedding (top). We aim to interpret pretrained location encoders through three complementary explainability methods (bottom): sparse latent concepts, text decomposition, and feature attribution. single input image, but rather to a specific latitude/longitude coordinate, and currently, there i… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative analysis of selected neurons from the top-10 ranked by visual (top and middle row) and geospatial (bottom row) monosemanticity for SAE reconstruction on the Climplicit encoder. The left panels display images from the top-10 activating locations from the S2-100k and Geo-YFCC datasets for selected neurons, with the MS values indicating their visual monosemanticity. The corresponding spatial distr… view at source ↗
Figure 3
Figure 3. Figure 3: SpLiCE decompositions highlight different signals captured by location embeddings. Bar plots show the concept weights of the top 4 concepts in SpLiCE decompositions for GeoCLIP (top) and SatCLIP (bottom) across four locations representing different land cover types: Paris, Sahara, Siberia, and Bali. SpLiCE decomposition weights are averaged over a 0.5 ◦ grid within each region’s boundary. Results for the o… view at source ↗
Figure 4
Figure 4. Figure 4: Natural image saliency maps generated using GeoCLIP and CLIP Surgery, grouped into four categories based on the visual cues the model attends to for geolocation prediction, which include a) landmarks: recognizable architectural or cultural structures, b) landscapes: natural features such as terrain, vegetation patterns, and sky, c) text: textual cues embedded in the scene, such as signage or place names, a… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the visual element extraction pipeline for satellite location embeddings Starting from a raw image, we compute CLIP-Surgery saliency maps, threshold them into binary masks, normalize and split regions by area, fit bounding boxes around salient regions, and finally cluster all boxes across images within a fixed spatial radius. (a) Paris (b) Amsterdam (c) New York City (d) St. Louis [PITH_FULL_I… view at source ↗
Figure 6
Figure 6. Figure 6: Extracted visual elements from satellite saliency maps for four cities: Paris, Amster￾dam, New York City and St. Louis. Although clusters are derived independently per city, similar clusters were manually aligned across cities for comparison. Original saliency map examples for each city can be found at Appendix Section 7.5.3. embeddings and applying dimensionality reduction and clustering via t-SNE [52] an… view at source ↗
Figure 7
Figure 7. Figure 7: UMAP visualizations of GeoCLIP and SatCLIP embeddings (with respective concept sets). Location embeddings are generated from a dense grid of 100, 000 points sampled uniformly at random over the landmasses. Evaluation Regions for SpLiCE Concepts. We define four geographic regions using publicly available shapefiles. The Russia and Indonesia shapefiles are sourced from GADM (https:// gadm.org/download_countr… view at source ↗
Figure 8
Figure 8. Figure 8: Visual artifacts from the S2-100k dataset around the Arctic region identified by ana￾lyzing the images for the strongest activating samples for neurons ranked among the top-10 visual monosemanticity values. 7.5.2 SpLiCE Natural Language Explanations Paris Sahara Siberia Bali Climplicit CSP-fMoW [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: SpLiCE decompositions for Climplicit and CSP-fMoW. Top-4 concepts in SpLiCE decompositions for Climplicit (top) and CSP-fMoW (bottom) across four regions (chosen to represent different land cover types). All decompositions use a shared concept basis derived from Git-10M. We first provide SpLiCE decomposition results for additional location encoders Climplicit and CSP-fMoW in [PITH_FULL_IMAGE:figures/full_… view at source ↗
Figure 10
Figure 10. Figure 10: SpLiCE results on a smaller geospatial concept set, containing roughly 200 concepts, on GeoCLIP, SatCLIP, Climplicit, and CSP-fMoW Using a smaller geospatial concept set biases the natural language decompositions to contain solely the geographic information, with tradeoffs for how much fine-grained information can be extracted. 6 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: CLIP Surgery saliency maps generated from Sentinel-2 (S2) imagery using the Sat￾CLIP location and image encoders. The first row shows desert-like land cover from geographically distinct yet visually similar regions — the Atacama and African Deserts — while the second row shows tropical rainforest-like land cover from the Congo Basin and Amazon. Though attention to water edges and rivers emerges in rainfor… view at source ↗
Figure 12
Figure 12. Figure 12: Satellite image saliency maps for four cities: Paris, Amsterdam , St. Louis and New York City. While the model’s attention tends to highlight structural boundaries such as roads and waterways, the maps are difficult to interpret directly due to their scale and resolution. We therefore introduce an additional step of cropping and clustering to extract more interpretable visual patterns. (a) Florence (b) Pa… view at source ↗
Figure 13
Figure 13. Figure 13: Natural image saliency maps taken from Im2GPS dataset from four locations: Florence, Paris, Seoul, Peru, shown without (top) and with (bottom) CLIP Surgery applied. Confusing highlights on background elements, a known issue in CLIP-based models saliency maps, are noticeably reduced when CLIP Surgery is applied. 8 [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Spatial clustering of salient region crops within the Paris area. Crop embeddings were computed using a ResNet-50 pretrained on ImageNet weights, reduced to 2D via t-SNE, and clustered using DBSCAN. Colors denote distinct clusters. 9 [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
read the original abstract

Geographic implicit neural representations (INRs) learn to map any coordinate on Earth to a location embedding, implicitly encoding geospatial data into the weights of a neural network. Location embeddings are widely used off the shelf as general-purpose geospatial representations, yet users lack principled tools to audit what geographic or semantic information these embeddings capture. In this work, we analyze the information content of geographic INRs through their location embeddings. We decompose these embeddings into human-interpretable features$\unicode{x2014}$namely, (i) sparse latent concepts, (ii) natural language concepts, and (iii) visual features. The latent concept embeddings are learned using sparse autoencoders. To recover natural language concepts, we apply sparse linear concept embeddings (SpLiCE) over a predefined geospatial dictionary. Finally, visual features are extracted using saliency maps derived from CLIP Surgery. We show that location embeddings can be decomposed into human-interpretable representations while retaining high reconstruction capability, revealing interpretable geographic structures such as forests, deserts, and urban features. Across methods, sparse decompositions expose systematic differences in encoded information, ranging from urban structures to broader biome and climate signals, and pretraining-space saliency maps further highlight complementary features such as roads and landmarks. We hope this work provides a first step toward interpretable geospatial representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that location embeddings from geographic implicit neural representations (INRs) can be decomposed into human-interpretable features using (i) sparse autoencoders for latent concepts, (ii) SpLiCE over a predefined geospatial dictionary for natural language concepts, and (iii) CLIP Surgery saliency maps for visual features. It asserts that these decompositions retain high reconstruction capability while revealing interpretable geographic structures such as forests, deserts, and urban features, and that sparse decompositions expose systematic differences in encoded information ranging from urban structures to biome and climate signals.

Significance. If the central claims hold with supporting evidence, the work provides a useful first step toward auditing and interpreting widely used location embeddings as general-purpose geospatial representations. The multi-method decomposition approach (latent, linguistic, and visual) offers complementary views on encoded information and could support more trustworthy geospatial AI applications.

major comments (1)
  1. [Abstract] Abstract: the assertion that 'decompositions retain high reconstruction capability' and 'reveal interpretable geographic structures such as forests, deserts, and urban features' supplies no quantitative metrics, ablation results, dataset details, or reconstruction error values. This is load-bearing for the central claim that the chosen techniques faithfully recover semantic and visual content without substantial artifacts.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the work's significance and for highlighting this important point about the abstract. We agree that the abstract's claims would be strengthened by including quantitative support and will revise it accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'decompositions retain high reconstruction capability' and 'reveal interpretable geographic structures such as forests, deserts, and urban features' supplies no quantitative metrics, ablation results, dataset details, or reconstruction error values. This is load-bearing for the central claim that the chosen techniques faithfully recover semantic and visual content without substantial artifacts.

    Authors: We agree that the abstract would benefit from quantitative grounding. The body of the manuscript reports reconstruction metrics, ablation studies, and dataset details supporting the decompositions' fidelity (e.g., low reconstruction error while recovering interpretable structures). However, these specifics are not summarized in the abstract. In the revised version we will update the abstract to include key quantitative results, such as reconstruction error values and dataset references, to make the central claims more self-contained. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper applies three previously published external decomposition techniques (sparse autoencoders, SpLiCE over a geospatial dictionary, and CLIP Surgery saliency) to location embeddings produced by geographic INRs. No load-bearing step reduces a reported finding to a quantity defined by the authors' own fitted parameters, self-citation chain, or ansatz smuggled via prior work by the same authors. The central claim that the decompositions retain reconstruction capability and expose structures such as forests and deserts rests on the fidelity of these independent methods rather than on any self-referential definition or renaming of results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that INRs encode geospatial information and on the applicability of standard explainability tools; no new free parameters, axioms beyond standard ML practice, or invented entities are introduced.

axioms (1)
  • domain assumption Geographic implicit neural representations encode geospatial data into the weights of a neural network via location embeddings.
    Stated as the premise of the entire analysis in the abstract.

pith-pipeline@v0.9.1-grok · 5776 in / 1234 out tokens · 26179 ms · 2026-06-26T00:13:40.000950+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 18 canonical work pages

  1. [1]

    GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Fairness and representation in satellite- based poverty maps: evidence of urban-rural disparities and their impacts on downstream policy

    Emily Aiken, Esther Rolf, and Joshua Blumenstock. Fairness and representation in satellite- based poverty maps: evidence of urban-rural disparities and their impacts on downstream policy. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI ’23, 2023. doi: 10.24963/ijcai.2023/653. URL https://doi.org/10.24963...

  3. [3]

    Neural machine translation by jointly learning to align and translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. InProceedings of the 3rd International Conference on Learning Representations (ICLR), 2015. URLhttp://arxiv.org/abs/1409.0473

  4. [4]

    Allyn, and Auroop R

    Ivan Felipe Benavides-Martinez, Justin Guthrie, Jhon Edwin Arias, Yeison Alberto Garces- Gomez, Angela Ines Guzman-Alvis, Cristiam Victoriano Portilla-Cabrera, Somnath Mondal, Andrew J. Allyn, and Auroop R. Ganguly. What on Earth is AlphaEarth? Hierarchical structure and functional interpretability for global land cover.arXiv preprint arXiv:2603.16911, 2026

  5. [5]

    Calmon, and Himabindu Lakkaraju

    Usha Bhalla, Alex Oesterling, Suraj Srinivas, Flavio P. Calmon, and Himabindu Lakkaraju. Interpreting clip with sparse linear concept embeddings (splice). 37:84298–84328, 2024. doi: 10.52202/079017-2678. URL https://proceedings.neurips.cc/paper_files/paper/ 2024/file/996bef37d8a638f37bdfcac2789e835d-Paper-Conference.pdf

  6. [6]

    Brown, Michal Kazmierski, Valerie J

    Christopher F. Brown, Michal Kazmierski, Valerie J. Pasquarella, William J. Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, Noel Gorelick, Lihui Zhang, Sophia Alj, Emily Schechter, Sean Askay, Oliver Guinan, Rebecca Moore, Alexis Boukouvalas, and Pushmeet Kohli. Alphaearth foundations: An embe...

  7. [7]

    Batchtopk sparse autoencoders

    Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. InNeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024. URL https: //openreview.net/forum?id=d4dpOCqybL

  8. [8]

    Adapting Grad-CAM for embedding networks

    Lei Chen, Jianhui Chen, Hossein Hajimirsadeghi, and Greg Mori. Adapting Grad-CAM for embedding networks. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2794–2803, 2020

  9. [9]

    Functional map of the world

    Gordon Christie, Neil Fendley, James Wilson, and Ryan Mukherjee. Functional map of the world. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6172–6180, 2018

  10. [10]

    Spatial implicit neural representations for global-scale species mapping

    Elijah Cole, Grant Van Horn, Christian Lange, Alexander Shepard, Patrick Leary, Pietro Perona, Scott Loarie, and Oisin Mac Aodha. Spatial implicit neural representations for global-scale species mapping. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 6320–6342. PMLR, 23–...

  11. [11]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848. 11

  12. [12]

    Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Sivic, and Alexei A. Efros. What makes Paris look like Paris?Commun. ACM, 58(12):103–110, November 2015. doi: 10.1145/2830541. URLhttps://doi.org/10.1145/2830541

  13. [13]

    Climplicit: Climatic implicit embeddings for global ecological tasks

    Johannes Dollinger, Damien Robert, Elena Plekhanova, Lukas Drees, and Jan Dirk Wegner. Climplicit: Climatic implicit embeddings for global ecological tasks. InICLR 2025 Workshop on Tackling Climate Change with Machine Learning, 2025. URLhttps://www.climatechange. ai/papers/iclr2025/44

  14. [14]

    Adaptive methods for real-world domain generalization

    Abhimanyu Dubey, Vignesh Ramanathan, Alex Pentland, and Dhruv Mahajan. Adaptive methods for real-world domain generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14340–14349, June 2021

  15. [15]

    A density-based algorithm for discovering clusters in large spatial databases with noise

    Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. InProceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, page 226–231. AAAI Press, 1996

  16. [16]

    Combi-CAM: A novel multi-layer approach for explainable image geolocalization

    David Faget, José Luis Lisani, and Miguel Colom. Combi-CAM: A novel multi-layer approach for explainable image geolocalization. In21st International Conference on Computer Vision Theory and Applications, volume 1, pages 275–281. SCITEPRESS-Science and Technology Publications, 2026

  17. [17]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  18. [18]

    Natural language descriptions of deep features

    Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep features. InInternational Confer- ence on Learning Representations, 2022. URL https://openreview.net/forum?id= NudBMY-tzDr

  19. [19]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9

  20. [20]

    Sparse autoencoders find highly interpretable features in language models

    Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=F76bwRSLeK

  21. [21]

    Opening the black box: A systematic review on explainable artificial intelligence in remote sensing.IEEE Geoscience and Remote Sensing Magazine, 12(4):261–304, 2024

    Adrian Höhl, Ivica Obadic, Miguel-Ángel Fernández-Torres, Hiba Najjar, Dario Augusto Borges Oliveira, Zeynep Akata, Andreas Dengel, and Xiao Xiang Zhu. Opening the black box: A systematic review on explainable artificial intelligence in remote sensing.IEEE Geoscience and Remote Sensing Magazine, 12(4):261–304, 2024. doi: 10.1109/MGRS.2024.3467001

  22. [22]

    Majda Ivi´c. Artificial intelligence and geospatial analysis in disaster management.The In- ternational Archives of the Photogrammetry, Remote Sensing and Spatial Information Sci- ences, XLII-3/W8:161–166, 2019. doi: 10.5194/isprs-archives-XLII-3-W8-161-2019. URL https://isprs-archives.copernicus.org/articles/XLII-3-W8/161/2019/

  23. [23]

    Towards interpretable geo-localization: a concept-aware global image-GPS alignment framework.arXiv preprint arXiv:2509.01910, 2025

    Furong Jia, Lanxin Liu, Ce Hou, Fan Zhang, Xinyan Liu, and Yu Liu. Towards interpretable geo-localization: a concept-aware global image-GPS alignment framework.arXiv preprint arXiv:2509.01910, 2025

  24. [24]

    Peter Linder, and Michael Kessler

    Dirk Nikolaus Karger, Olaf Conrad, Jürgen Böhner, Tobias Kawohl, Holger Kreft, Ro- drigo Wilber Soria-Auza, Niklaus E Zimmermann, H. Peter Linder, and Michael Kessler. Climatologies at high resolution for the earth’s land surface areas.Scientific data, 4(1):1–20,

  25. [25]

    URLhttps://doi.org/10.1038/sdata.2017.122

    doi: 10.1038/sdata.2017.122. URLhttps://doi.org/10.1038/sdata.2017.122

  26. [26]

    SatCLIP: Global, general-purpose location embeddings with satellite imagery

    Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm. SatCLIP: Global, general-purpose location embeddings with satellite imagery. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4347–4355, 2025. 12

  27. [27]

    Klemmer et al., ”Earth Embeddings: Towards ai-centric represen- tations of our planet,” EarthArXiv, Dec

    Konstantin Klemmer, Esther Rolf, Marc Russwurm, Gustau Camps-Valls, Mikolaj Czerkawski, Stefano Ermon, Alistair Francis, Nathan Jacobs, Hannah Rae Kerner, Lester Mackey, Gengchen Mai, Oisin Mac Aodha, Markus Reichstein, Caleb Robinson, David Rolnick, Evan Shelhamer, Vincent Sitzmann, Devis Tuia, and Xiaoxiang Zhu. Earth embeddings: Towards AI-centric repr...

  28. [28]

    Science, 382 (6677), 1416--1421, doi:10.1126/science.adi2336

    Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Alexander Merose, Stephan Hoyer, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mohamed, and Peter Battaglia. Learning skillful medium-range global weather forecasting. Scien...

  29. [29]

    Martha Larson, Mohammad Soleymani, Guillaume Gravier, Bogdan Ionescu, and Gareth J.F. Jones. The benchmarking initiative for multimedia evaluation: Mediaeval 2016.IEEE MultiMe- dia, 24(1):93–96, 2017. doi: 10.1109/MMUL.2017.9

  30. [30]

    Exploring visual interpretabil- ity for contrastive language-image pre-training.arXiv preprint arXiv:2209.07046, 2022

    Yi Li, Hualiang Wang, Yiqun Duan, Hang Xu, and Xiaomeng Li. Exploring visual interpretabil- ity for contrastive language-image pre-training.arXiv preprint arXiv:2209.07046, 2022

  31. [31]

    A closer look at the explainability of contrastive language-image pre-training.Pattern Recognition, 162:111409,

    Yi Li, Hualiang Wang, Yiqun Duan, Jiheng Zhang, and Xiaomeng Li. A closer look at the explainability of contrastive language-image pre-training.Pattern Recognition, 162:111409,

  32. [32]

    doi: https://doi.org/10.1016/j.patcog.2025.111409

    ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2025.111409. URL https: //www.sciencedirect.com/science/article/pii/S003132032500069X

  33. [33]

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learn- ing

    Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learn- ing. InAdvances in Neural Information Processing Systems, volume 35, pages 17612– 17625, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/ file/702f4db7543a7432431df588d5...

  34. [34]

    Text2earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model

    Chenyang Liu, Keyan Chen, Rui Zhao, Zhengxia Zou, and Zhenwei Shi. Text2earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model. IEEE Geoscience and Remote Sensing Magazine, 13(3):238–259, 2025. doi: 10.1109/MGRS. 2025.3560455

  35. [35]

    Lundberg and Su-In Lee

    Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. InAdvances in Neural Information Processing Systems, volume 30,

  36. [36]

    URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 8a20a8621978632d76c43dfd28b67767-Paper.pdf

  37. [37]

    Presence-only geographical priors for fine-grained image classification

    Oisin Mac Aodha, Elijah Cole, and Pietro Perona. Presence-only geographical priors for fine-grained image classification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9596–9606, 2019

  38. [38]

    Multi-scale repre- sentation learning for spatial feature distributions using grid cells

    Gengchen Mai, Krzysztof Janowicz, Bo Yan, Rui Zhu, Ling Cai, and Ni Lao. Multi-scale repre- sentation learning for spatial feature distributions using grid cells. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rJljdh4KDH

  39. [39]

    A review of location encoding for geoai: methods and applications.International Journal of Geographical Information Science, 36(4):639–673, 2022

    Gengchen Mai, Krzysztof Janowicz, Yingjie Hu, Song Gao, Bo Yan, Rui Zhu, Ling Cai, and Ni Lao. A review of location encoding for geoai: methods and applications.International Journal of Geographical Information Science, 36(4):639–673, 2022. doi: 10.1080/13658816. 2021.2004602. URLhttps://doi.org/10.1080/13658816.2021.2004602

  40. [40]

    UMAP: Uniform manifold approximation and projection.The Journal of Open Source Software, 3(29):861, 2018

    Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. UMAP: Uniform manifold approximation and projection.The Journal of Open Source Software, 3(29):861, 2018

  41. [41]

    Text-to-concept (and back) via cross-model alignment

    Mazda Moayeri, Keivan Rezaei, Maziar Sanjabi, and Soheil Feizi. Text-to-concept (and back) via cross-model alignment. InInternational Conference on Machine Learning, pages 25037–25060. PMLR, 2023. 13

  42. [42]

    CLIP-dissect: Automatic description of neuron rep- resentations in deep vision networks

    Tuomas Oikarinen and Tsui-Wei Weng. CLIP-dissect: Automatic description of neuron rep- resentations in deep vision networks. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=iPWiwWHc1V

  43. [43]

    Sparse autoencoders learn monosemantic features in vision-language mod- els

    Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, and Zeynep Akata. Sparse autoencoders learn monosemantic features in vision-language mod- els. InAdvances in Neural Information Processing Systems, volume 38, pages 95706– 95742, 2025. URL https://proceedings.neurips.cc/paper_files/paper/2025/ file/89e83382abeee53b932a6df62edbf9cc-Paper-Co...

  44. [44]

    Nowara, Joshua Gleason, Carlos D

    Shraman Pramanick, Ewa M. Nowara, Joshua Gleason, Carlos D. Castillo, and Rama Chellappa. Where in the world is this image? transformer-based geo-localization in the wild. InEuropean Conference on Computer Vision, pages 196–215, 2022. URL https://doi.org/10.1007/ 978-3-031-19839-7_12

  45. [45]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 ofPro- ceeding...

  46. [46]

    Measuring the intrinsic dimension of earth representations

    Arjun Rao, Marc Rußwurm, Konstantin Klemmer, and Esther Rolf. Measuring the intrinsic dimension of earth representations. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=gQPD83DrGp

  47. [47]

    Why Should I Trust You?

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should I trust you?": Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 1135–1144, New York, NY , USA, 2016. Association for Computing Machinery. doi: 10.1145/2939672.2939778. URL http...

  48. [48]

    Ge- ographic location encoding with spherical harmonics and sinusoidal representation net- works

    Marc Rußwurm, Konstantin Klemmer, Esther Rolf, Robin Zbinden, and Devis Tuia. Ge- ographic location encoding with spherical harmonics and sinusoidal representation net- works. InInternational Conference on Learning Representations, volume 2024, pages 1746– 1759, 2024. URL https://proceedings.iclr.cc/paper_files/paper/2024/file/ 073c8584ef86bee26fe9d639ec6...

  49. [49]

    Grad-CAM: Visual explanations from deep networks via gradient- based localization

    Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual explanations from deep networks via gradient- based localization. InProceedings of the IEEE International Conference on Computer Vision, pages 618–626, 2017

  50. [50]

    A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models

    Dong Shu, Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, and Mengnan Du. A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 1690–1712. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emn...

  51. [51]

    Implicit neural representations with periodic activation functions

    Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. In Advances in Neural Information Processing Systems, volume 33, pages 7462–7473,

  52. [52]

    URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ 53c04118df112c13a8c34b38343b9c10-Paper.pdf

  53. [53]

    Stewart, Caleb Robinson, Isaac A

    Adam J. Stewart, Caleb Robinson, Isaac A. Corley, Anthony Ortiz, Juan M. Lavista Ferres, and Arindam Banerjee. Torchgeo: Deep learning with geospatial data.ACM Trans. Spatial Algorithms Syst., 11(4), 2025. ISSN 2374-0353. doi: 10.1145/3707459. URL https://doi. org/10.1145/3707459

  54. [54]

    Fourier fea- tures let networks learn high frequency functions in low dimensional domains

    Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Ragha- van, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier fea- tures let networks learn high frequency functions in low dimensional domains. In 14 Advances in Neural Information Processing Systems, volume 33, pages 7537–7547,

  55. [55]

    URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ 55053683268957697aa39fba6f231c68-Paper.pdf

  56. [56]

    V o, John Brandt, Justine Spore, Sayantan Majumdar, Daniel Haziza, Janaki Vamaraju, Theo Moutakanni, Piotr Bojanowski, Tracy Johns, Brian White, Tobias Tiecke, and Camille Couprie

    Jamie Tolan, Hung-I Yang, Benjamin Nosarzewski, Guillaume Couairon, Huy V . V o, John Brandt, Justine Spore, Sayantan Majumdar, Daniel Haziza, Janaki Vamaraju, Theo Moutakanni, Piotr Bojanowski, Tracy Johns, Brian White, Tobias Tiecke, and Camille Couprie. Very high resolution canopy height maps from RGB imagery using self-supervised vision transformer an...

  57. [57]

    Vo, John Brandt, JustineSpore, SayantanMajumdar, DanielHaziza, JanakiVamaraju, TheoMoutakanni, Piotr Bojanowski, Tracy Johns, Brian White, Tobias Tiecke, and Camille Couprie

    ISSN 0034-4257. doi: https://doi.org/10.1016/j.rse.2023.113888. URL https://www. sciencedirect.com/science/article/pii/S003442572300439X

  58. [58]

    Visualizing data using t-SNE.Journal of Machine Learning Research, 9(86):2579–2605, 2008

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE.Journal of Machine Learning Research, 9(86):2579–2605, 2008. URL http://jmlr.org/papers/v9/ vandermaaten08a.html

  59. [59]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Infor- mation Processing Systems, volume 30, 2017. URL https://proceedings.neurips.cc/ paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  60. [60]

    Geoclip: Clip- inspired alignment between locations and images for effective worldwide geo-localization

    Vicente Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. Geoclip: Clip- inspired alignment between locations and images for effective worldwide geo-localization. InAdvances in Neural Information Processing Systems, volume 36, pages 8690–8701,

  61. [61]

    URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 1b57aaddf85ab01a2445a79c9edc1f4b-Paper-Conference.pdf

  62. [62]

    A proxy consistency loss for grounded fusion of earth observation and location encoders

    Zhongying Wang, Kevin Lane, Levi Cai, Morteza Karimzadeh, and Esther Rolf. A proxy consistency loss for grounded fusion of earth observation and location encoders. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 8075–8084, June 2026

  63. [63]

    LiT: Zero-shot transfer with locked-image text tuning

    Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. LiT: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18123–18133, June 2022

  64. [64]

    Learning deep features for discriminative localization

    Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

  65. [65]

    Stewart, Konrad Heidler, Yuanyuan Wang, Zhenghang Yuan, Thomas Dujardin, Qingsong Xu, and Yilei Shi

    Xiao Xiang Zhu, Zhitong Xiong, Yi Wang, Adam J. Stewart, Konrad Heidler, Yuanyuan Wang, Zhenghang Yuan, Thomas Dujardin, Qingsong Xu, and Yilei Shi. On the foundations of earth foundation models.Communications Earth & Environment, 7(1):103, Jan 2026. doi: 10.1038/s43247-025-03127-x. URLhttps://doi.org/10.1038/s43247-025-03127-x. 15 7 Appendix 7.1 Reproduc...

  66. [66]

    rocky mountains

    We use a cosine learning rate schedule. The logit temperature is learnable and is initialized at 0.07. We train for 55k steps, validating every 500 steps and applying early stopping based on validation loss (with a patience of 5 validation checks). Training is conducted on NVIDIA RTX 8000s and H100s. Assessing and Visualizing Location–Text Alignment.We sh...

  67. [67]

    desert” and “wadi

    feature, while the sparse latent representation z=f enc(x) determines which features are active for a given input. 7.4.2 Sparse Linear Concept Embeddings (SpLiCE) To apply SpLiCE, we use apredefinedconcept dictionary, C∈R nc×dee, where nc is the num- ber of concepts. Each row in C corresponds to an embedding in the same space as the location embedding x∈R...