pith. sign in

arxiv: 2406.04952 · v2 · submitted 2024-06-07 · 💻 cs.CL · cs.AI

Quantifying Geospatial in the Common Crawl Corpus

Pith reviewed 2026-05-23 23:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Common Crawlgeospatial informationweb corpuslarge language modelsprevalencespatial data
0
0 comments X

The pith

18.7% of documents in the Common Crawl corpus contain geospatial information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper measures the amount of geospatial content in the Common Crawl corpus used to train large language models. The authors draw a sample from recent releases, apply Gemini 1.5 to flag documents with coordinates, addresses, and related details, and manually correct the flags. They arrive at an estimate of 18.7% of documents containing such information. The rate shows little difference between English documents and those in other languages. The result supplies a concrete baseline for understanding the spatial knowledge that reaches language models through their training data.

Core claim

Analysis of a sample of Common Crawl documents, classified for geospatial content by Gemini 1.5 and then manually revised, leads to the conclusion that 18.7% of web documents contain geospatial information such as coordinates and addresses, with comparable prevalence in English and non-English documents.

What carries the argument

Gemini 1.5-based classification of sample documents for the presence of geospatial details, revised by hand to yield the overall prevalence figure.

If this is right

  • Large language models encounter geospatial data in roughly one-fifth of their Common Crawl training documents.
  • The share of geospatial content remains similar regardless of document language.
  • The estimate supplies a foundation for investigating geospatial biases within trained language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Analyses of other content types in the Common Crawl could follow the same sampling and classification approach.
  • Data preparation pipelines for language model training might reference this percentage when deciding on filters for spatial information.
  • The finding suggests testing whether models' spatial performance scales with the measured prevalence in their training sets.

Load-bearing premise

The sample of documents is representative of the full Common Crawl corpus and the manual revisions confirm the accuracy of the model's classifications for geospatial content.

What would settle it

Classification of a much larger or exhaustive set of Common Crawl documents by human reviewers that produces a percentage far from 18.7%.

Figures

Figures reproduced from arXiv: 2406.04952 by Ilya Ilyankou, James Haworth, Meihui Wang, Stefano Cavazzi.

Figure 1
Figure 1. Figure 1: The distribution of sampled web documents’ length, [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Language frequency among sampled documents vs [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The prevalence of geospatial data in select CC releases, estimated within [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Geospatial data frequency in English and non [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Large language models (LLMs) exhibit emerging geospatial capabilities, stemming from their pre-training on vast unlabelled text datasets that are often derived from the Common Crawl (CC) corpus. However, the geospatial content within CC remains largely unexplored, impacting our understanding of LLMs' spatial reasoning. This paper investigates the prevalence of geospatial data in recent Common Crawl releases using Gemini 1.5, a powerful language model. By analyzing a sample of documents and manually revising the results, we estimate that 18.7% of web documents in CC contain geospatial information such as coordinates and addresses. We find little difference in prevalence between Enlgish- and non-English-language documents. Our findings provide quantitative insights into the nature and extent of geospatial data in CC, and lay the groundwork for future studies of geospatial biases of LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper uses Gemini 1.5 to classify a sample of documents from recent Common Crawl releases, followed by manual revision, to estimate that 18.7% of web documents contain geospatial information such as coordinates and addresses. It finds little difference in prevalence between English- and non-English-language documents.

Significance. If the sampling and classification methodology can be validated, this provides a quantitative measure of geospatial content in a major pretraining corpus, which is relevant for understanding the sources of LLMs' geospatial capabilities and potential biases in spatial reasoning.

major comments (2)
  1. [Abstract] Abstract: The central prevalence estimate of 18.7% is presented without any information on sample size, sampling frame or stratification (by language, domain, crawl date, or duplicates), or the manual revision protocol (fraction revised, reviewer count, or reliability metrics). This directly undermines assessment of the claim's reliability and generalizability.
  2. [Abstract] Abstract (and implied Methods): No error rates, validation against a gold standard, or inter-annotator agreement are reported for the Gemini 1.5 classifications even after manual revision, leaving the accuracy of the labels unquantified and the 18.7% figure without a clear uncertainty bound.
minor comments (1)
  1. [Abstract] Abstract: Typo in 'Enlgish-language' (should be 'English-language').

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments, which highlight important aspects of reporting transparency. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central prevalence estimate of 18.7% is presented without any information on sample size, sampling frame or stratification (by language, domain, crawl date, or duplicates), or the manual revision protocol (fraction revised, reviewer count, or reliability metrics). This directly undermines assessment of the claim's reliability and generalizability.

    Authors: We agree that the abstract should be self-contained with respect to these details. The full methods section describes the sampling from recent Common Crawl releases and the English/non-English stratification, but we will revise the abstract to explicitly state the sample size, sampling frame, stratification approach, and manual revision protocol (including fraction revised and reviewer count). Reliability metrics from the revision process will be included if they were recorded. revision: yes

  2. Referee: [Abstract] Abstract (and implied Methods): No error rates, validation against a gold standard, or inter-annotator agreement are reported for the Gemini 1.5 classifications even after manual revision, leaving the accuracy of the labels unquantified and the 18.7% figure without a clear uncertainty bound.

    Authors: This is a fair observation. The study relied on manual revision of Gemini outputs to improve label quality, but no formal error rates, gold-standard validation, or inter-annotator agreement statistics were computed. We will revise the abstract and methods to describe the revision protocol in more detail and to note the lack of quantitative accuracy metrics as a limitation, while clarifying that the 18.7% estimate is based on the revised labels. revision: partial

standing simulated objections not resolved
  • Specific numerical error rates, gold-standard validation results, or inter-annotator agreement values cannot be supplied because these were not calculated during the original analysis.

Circularity Check

0 steps flagged

No circularity: empirical prevalence estimate from sampling and classification

full rationale

The paper's central claim is a direct empirical count (18.7% prevalence) obtained by sampling CC documents, applying Gemini 1.5 classification, and performing manual revision. No equations, fitted parameters, derivations, or self-citations appear in the load-bearing steps. The result is a straightforward proportion from observed data and does not reduce to its inputs by construction. This matches the default case of a self-contained empirical study with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The estimate rests on two unverified premises: that the LLM classifier plus manual review produces reliable labels, and that the chosen sample mirrors the full corpus distribution. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Gemini 1.5 can be prompted to detect geospatial information in web documents with sufficient accuracy after manual correction
    The 18.7% figure is produced by this classification step; its validity depends on the assumption.
  • domain assumption The sampled documents are statistically representative of recent Common Crawl releases
    The prevalence claim is extrapolated from the sample to the entire corpus.

pith-pipeline@v0.9.0 · 5672 in / 1312 out tokens · 24531 ms · 2026-05-23T23:53:34.564738+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 5 internal anchors

  1. [1]

    Stefan Baack. 2024. A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl. In The 2024 ACM Conference on Fairness, Accountability, and Transparency. ACM, Rio de Janeiro Brazil, 2199–2208. https: //doi.org/10.1145/3630106.3659033

  2. [2]

    Stefan Baack. 2024. Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI. (Feb. 2024)

  3. [3]

    James E Bartlett, Joe W Kotrlik, and Chadwick C Higgins. 2001. Organizational Research: Determining Appropriate Sample Size in Survey Research. (2001)

  4. [4]

    Prabin Bhandari, Antonios Anastasopoulos, and Dieter Pfoser. 2023. Are Large Language Models Geospatially Knowledgeable?. In Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems (SIGSPA- TIAL ’23). Association for Computing Machinery, New York, NY, USA, 1–4. https://doi.org/10.1145/3589132.3625625

  5. [5]

    Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021. Multimodal datasets: misogyny, pornography, and malignant stereotypes. http://arxiv.org/ abs/2110.01963 arXiv:2110.01963 [cs]

  6. [6]

    John Bossler. 2010. Manual of Geospatial Science and Technology . CRC Press. Google-Books-ID: UdZ3uDekqwwC

  7. [7]

    Chia-Hui Chang and Shu-Ying Li. 2010. MapMarker: Extraction of Postal Addresses and Associated Information for General Web Pages. In 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. IEEE, Toronto, AB, Canada, 105–111. https://doi.org/10.1109/WI- IAT.2010.64

  8. [8]

    Common Crawl. 2024. Common Crawl - Open Repository of Web Crawl Data. https://commoncrawl.org/

  9. [9]

    Common Crawl. 2024. Statistics of Common Crawl Monthly Archives by com- moncrawl. https://commoncrawl.github.io/cc-crawl-statistics/plots/languages

  10. [10]

    Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. http://arxiv.org/abs/2104.08758 arXiv:2104.08758 [cs]

  11. [11]

    Julia Efremova, Ian Endres, Isaac Vidas, and Ofer Melnik. 2018. A Geo-Tagging Framework for Address Extraction from Web Pages. In Advances in Data Mining. Applications and Theoretical Aspects, Petra Perner (Ed.). Springer International Publishing, Cham, 288–295. https://doi.org/10.1007/978-3-319-95786-9_22

  12. [12]

    Nir Fulman, Abdulkadir Memduhoğlu, and Alexander Zipf. 2024. Evidence for Systematic Bias in the Spatial Memory of Large Language Models. (2024)

  13. [13]

    Wes Gurnee and Max Tegmark. 2024. Language Models Represent Space and Time. https://doi.org/10.48550/arXiv.2310.02207 arXiv:2310.02207 [cs]

  14. [14]

    Ilya Ilyankou, Meihui Wang, James Haworth, and Stefano Cavazzi. 2024. CC- GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl. https://doi.org/10.48550/arXiv.2405.11039 arXiv:2405.11039 [cs]

  15. [15]

    ISO. 2012. geographic data — MLGT: The authoritative multi-lingual geographic information terminology database. https://isotc211.geolexica.org/concepts/202/

  16. [16]

    I Think i Discovered a Military Base in the Middle of the Ocean

    Levente Juhasz and Peter Mooney. 2022. “I Think i Discovered a Military Base in the Middle of the Ocean”—Null Island, the Most Real of Fictional Places. IEEE Access 10 (2022), 84147–84165. https://doi.org/10.1109/ACCESS.2022.3197222 Conference’17, July 2017, Washington, DC, USA Ilya Ilyankou, Meihui Wang, Stefano Cavazzi, and James Haworth

  17. [17]

    Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasan- bayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Clay- tone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Or- tiz Suarez, Iroro Orife, Kelechi Ogueji, Andre N...

  18. [18]

    Transactions of the Association for Computational Linguistics 10 (Jan

    Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics 10 (Jan. 2022), 50–72. https://doi.org/10.1162/tacl_a_00447 arXiv:2103.12028 [cs]

  19. [19]

    Alexandra Sasha Luccioni and Joseph D. Viviano. 2021. What’s in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus. http://arxiv.org/abs/2105.02732 arXiv:2105.02732 [cs]

  20. [20]

    Gengchen Mai, Weiming Huang, Jin Sun, Suhang Song, Deepak Mishra, Ninghao Liu, Song Gao, Tianming Liu, Gao Cong, Yingjie Hu, Chris Cundy, Ziyuan Li, Rui Zhu, and Ni Lao. 2023. On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence. http://arxiv.org/abs/2304.06798 arXiv:2304.06798 [cs]

  21. [21]

    Rohin Manvi, Samar Khanna, Marshall Burke, David Lobell, and Stefano Ermon

  22. [22]

    arXiv:2402.02680, 2024

    Large Language Models are Geographically Biased. http://arxiv.org/abs/ 2402.02680 arXiv:2402.02680 [cs]

  23. [23]

    Mazda Moayeri and Soheil Feizi. 2024. WorldBench: Quantifying Geographic Disparities in LLM Factual Recall. https://openreview.net/forum?id=fubvUIBggI

  24. [24]

    Hannes Mühleisen and Christian Bizer. 2012. Web Data Commons – Extracting Structured Data from Two Large Web Corpora. (2012)

  25. [25]

    John Paolillo, Daniel Pimienta, and Daniel Prado. 2005. Measuring Linguistic Diversity on the Internet. (2005)

  26. [26]

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. http://arxiv.org/abs/2306. 01116 arXiv:2306.01116 [cs]

  27. [27]

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Im- proving Language Understanding by Generative Pre-Training. (2018)

  28. [28]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. http: //arxiv.org/abs/1910.10683 arXiv:1910.10683 [cs, stat]

  29. [29]

    2013.Linked Data: Structured data on the Web

    Luke Ruth, David Wood, Marsha Zaidman, and Michael Hausenblas. 2013.Linked Data: Structured data on the Web . Simon and Schuster. Google-Books-ID: hTozEAAAQBAJ

  30. [30]

    Sebastian Schmidt, Simon Manschitz, Christoph Rensing, and Ralf Steinmetz

  31. [31]

    In Proceedings of the 13th International Conference on Knowl- edge Management and Knowledge Technologies

    Extraction of Address Data from Unstructured Text using Free Knowl- edge Resources. In Proceedings of the 13th International Conference on Knowl- edge Management and Knowledge Technologies . ACM, Graz Austria, 1–8. https: //doi.org/10.1145/2494188.2494193

  32. [32]

    Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent Romary. 2021. Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastruc- tures. (2021)

  33. [33]

    Gemini Team. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530 [cs.CL]

  34. [34]

    Willem Robert Van Hage, Thomas Ploeger, and Jesper Hoeksema. 2014. Number frequency on the web. InProceedings of the 23rd International Conference on World Wide Web. ACM, Seoul Korea, 571–572. https://doi.org/10.1145/2567948.2576962

  35. [35]

    W3Techs. 2024. Usage Statistics and Market Share of Top Level Domains for Websites, August 2024. https://w3techs.com/technologies/overview/top_level_ domain

  36. [36]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022. Finetuned Language Models Are Zero-Shot Learners. http://arxiv.org/abs/2109.01652 arXiv:2109.01652 [cs]

  37. [37]

    Tingyu Xie, Qi Li, Jian Zhang, Yan Zhang, Zuozhu Liu, and Hongwei Wang. 2023. Empirical Study of Zero-Shot NER with ChatGPT. (2023)

  38. [38]

    Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, and Hoifung Poon. 2024. UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition. http://arxiv.org/abs/2308.03279 arXiv:2308.03279 [cs]. Received 31 May 2024