Quantifying Geospatial in the Common Crawl Corpus

Ilya Ilyankou; James Haworth; Meihui Wang; Stefano Cavazzi

arxiv: 2406.04952 · v2 · submitted 2024-06-07 · 💻 cs.CL · cs.AI

Quantifying Geospatial in the Common Crawl Corpus

Ilya Ilyankou , Meihui Wang , Stefano Cavazzi , James Haworth This is my paper

Pith reviewed 2026-05-23 23:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Common Crawlgeospatial informationweb corpuslarge language modelsprevalencespatial data

0 comments

The pith

18.7% of documents in the Common Crawl corpus contain geospatial information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper measures the amount of geospatial content in the Common Crawl corpus used to train large language models. The authors draw a sample from recent releases, apply Gemini 1.5 to flag documents with coordinates, addresses, and related details, and manually correct the flags. They arrive at an estimate of 18.7% of documents containing such information. The rate shows little difference between English documents and those in other languages. The result supplies a concrete baseline for understanding the spatial knowledge that reaches language models through their training data.

Core claim

Analysis of a sample of Common Crawl documents, classified for geospatial content by Gemini 1.5 and then manually revised, leads to the conclusion that 18.7% of web documents contain geospatial information such as coordinates and addresses, with comparable prevalence in English and non-English documents.

What carries the argument

Gemini 1.5-based classification of sample documents for the presence of geospatial details, revised by hand to yield the overall prevalence figure.

If this is right

Large language models encounter geospatial data in roughly one-fifth of their Common Crawl training documents.
The share of geospatial content remains similar regardless of document language.
The estimate supplies a foundation for investigating geospatial biases within trained language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Analyses of other content types in the Common Crawl could follow the same sampling and classification approach.
Data preparation pipelines for language model training might reference this percentage when deciding on filters for spatial information.
The finding suggests testing whether models' spatial performance scales with the measured prevalence in their training sets.

Load-bearing premise

The sample of documents is representative of the full Common Crawl corpus and the manual revisions confirm the accuracy of the model's classifications for geospatial content.

What would settle it

Classification of a much larger or exhaustive set of Common Crawl documents by human reviewers that produces a percentage far from 18.7%.

Figures

Figures reproduced from arXiv: 2406.04952 by Ilya Ilyankou, James Haworth, Meihui Wang, Stefano Cavazzi.

**Figure 2.** Figure 2: Language frequency among sampled documents vs [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The prevalence of geospatial data in select CC releases, estimated within [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Geospatial data frequency in English and non [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Large language models (LLMs) exhibit emerging geospatial capabilities, stemming from their pre-training on vast unlabelled text datasets that are often derived from the Common Crawl (CC) corpus. However, the geospatial content within CC remains largely unexplored, impacting our understanding of LLMs' spatial reasoning. This paper investigates the prevalence of geospatial data in recent Common Crawl releases using Gemini 1.5, a powerful language model. By analyzing a sample of documents and manually revising the results, we estimate that 18.7% of web documents in CC contain geospatial information such as coordinates and addresses. We find little difference in prevalence between Enlgish- and non-English-language documents. Our findings provide quantitative insights into the nature and extent of geospatial data in CC, and lay the groundwork for future studies of geospatial biases of LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's 18.7% estimate for geospatial documents in Common Crawl is a new empirical number but rests on sampling and classification details that the abstract does not supply.

read the letter

The main thing here is a direct count: after running a sample through Gemini 1.5 and doing manual revision, the authors put geospatial content (coordinates, addresses) at 18.7% of Common Crawl documents, with almost no difference between English and non-English pages. That specific prevalence figure for the main pre-training corpus is new relative to earlier corpus studies. It supplies a concrete baseline that people working on spatial reasoning in LLMs or on geographic bias can actually use. The framing is clear about why the number matters for understanding where those capabilities come from. The work is honest empirical measurement with no circular derivations or invented parameters. The citation pattern looks standard and appropriate for the topic. The soft spot is exactly what the stress-test note flags. The abstract gives no sample size, no description of how the sample was drawn, no stratification by crawl or domain, and no numbers on how much of the output was manually revised or what the revision protocol was. Without those, it is hard to judge whether the 18.7% is representative or whether systematic classification errors survived the revision step. If the full paper contains those details they need to be front and center; from the abstract alone the central claim is under-supported. This is the kind of paper that would interest people who curate training data or build geospatial benchmarks. A reader in that area gets value from the number as a starting point even if they treat the exact percentage with caution until the methods are checked. It deserves peer review so referees can see the sampling frame and validation numbers and decide how much weight to give the result.

Referee Report

2 major / 1 minor

Summary. The paper uses Gemini 1.5 to classify a sample of documents from recent Common Crawl releases, followed by manual revision, to estimate that 18.7% of web documents contain geospatial information such as coordinates and addresses. It finds little difference in prevalence between English- and non-English-language documents.

Significance. If the sampling and classification methodology can be validated, this provides a quantitative measure of geospatial content in a major pretraining corpus, which is relevant for understanding the sources of LLMs' geospatial capabilities and potential biases in spatial reasoning.

major comments (2)

[Abstract] Abstract: The central prevalence estimate of 18.7% is presented without any information on sample size, sampling frame or stratification (by language, domain, crawl date, or duplicates), or the manual revision protocol (fraction revised, reviewer count, or reliability metrics). This directly undermines assessment of the claim's reliability and generalizability.
[Abstract] Abstract (and implied Methods): No error rates, validation against a gold standard, or inter-annotator agreement are reported for the Gemini 1.5 classifications even after manual revision, leaving the accuracy of the labels unquantified and the 18.7% figure without a clear uncertainty bound.

minor comments (1)

[Abstract] Abstract: Typo in 'Enlgish-language' (should be 'English-language').

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments, which highlight important aspects of reporting transparency. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central prevalence estimate of 18.7% is presented without any information on sample size, sampling frame or stratification (by language, domain, crawl date, or duplicates), or the manual revision protocol (fraction revised, reviewer count, or reliability metrics). This directly undermines assessment of the claim's reliability and generalizability.

Authors: We agree that the abstract should be self-contained with respect to these details. The full methods section describes the sampling from recent Common Crawl releases and the English/non-English stratification, but we will revise the abstract to explicitly state the sample size, sampling frame, stratification approach, and manual revision protocol (including fraction revised and reviewer count). Reliability metrics from the revision process will be included if they were recorded. revision: yes
Referee: [Abstract] Abstract (and implied Methods): No error rates, validation against a gold standard, or inter-annotator agreement are reported for the Gemini 1.5 classifications even after manual revision, leaving the accuracy of the labels unquantified and the 18.7% figure without a clear uncertainty bound.

Authors: This is a fair observation. The study relied on manual revision of Gemini outputs to improve label quality, but no formal error rates, gold-standard validation, or inter-annotator agreement statistics were computed. We will revise the abstract and methods to describe the revision protocol in more detail and to note the lack of quantitative accuracy metrics as a limitation, while clarifying that the 18.7% estimate is based on the revised labels. revision: partial

standing simulated objections not resolved

Specific numerical error rates, gold-standard validation results, or inter-annotator agreement values cannot be supplied because these were not calculated during the original analysis.

Circularity Check

0 steps flagged

No circularity: empirical prevalence estimate from sampling and classification

full rationale

The paper's central claim is a direct empirical count (18.7% prevalence) obtained by sampling CC documents, applying Gemini 1.5 classification, and performing manual revision. No equations, fitted parameters, derivations, or self-citations appear in the load-bearing steps. The result is a straightforward proportion from observed data and does not reduce to its inputs by construction. This matches the default case of a self-contained empirical study with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The estimate rests on two unverified premises: that the LLM classifier plus manual review produces reliable labels, and that the chosen sample mirrors the full corpus distribution. No free parameters or invented entities are introduced.

axioms (2)

domain assumption Gemini 1.5 can be prompted to detect geospatial information in web documents with sufficient accuracy after manual correction
The 18.7% figure is produced by this classification step; its validity depends on the assumption.
domain assumption The sampled documents are statistically representative of recent Common Crawl releases
The prevalence claim is extrapolated from the sample to the entire corpus.

pith-pipeline@v0.9.0 · 5672 in / 1312 out tokens · 24531 ms · 2026-05-23T23:53:34.564738+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 5 internal anchors

[1]

Stefan Baack. 2024. A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl. In The 2024 ACM Conference on Fairness, Accountability, and Transparency. ACM, Rio de Janeiro Brazil, 2199–2208. https: //doi.org/10.1145/3630106.3659033

work page doi:10.1145/3630106.3659033 2024
[2]

Stefan Baack. 2024. Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI. (Feb. 2024)

work page 2024
[3]

James E Bartlett, Joe W Kotrlik, and Chadwick C Higgins. 2001. Organizational Research: Determining Appropriate Sample Size in Survey Research. (2001)

work page 2001
[4]

Prabin Bhandari, Antonios Anastasopoulos, and Dieter Pfoser. 2023. Are Large Language Models Geospatially Knowledgeable?. In Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems (SIGSPA- TIAL ’23). Association for Computing Machinery, New York, NY, USA, 1–4. https://doi.org/10.1145/3589132.3625625

work page doi:10.1145/3589132.3625625 2023
[5]

Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021. Multimodal datasets: misogyny, pornography, and malignant stereotypes. http://arxiv.org/ abs/2110.01963 arXiv:2110.01963 [cs]

work page arXiv 2021
[6]

John Bossler. 2010. Manual of Geospatial Science and Technology . CRC Press. Google-Books-ID: UdZ3uDekqwwC

work page 2010
[7]

Chia-Hui Chang and Shu-Ying Li. 2010. MapMarker: Extraction of Postal Addresses and Associated Information for General Web Pages. In 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. IEEE, Toronto, AB, Canada, 105–111. https://doi.org/10.1109/WI- IAT.2010.64

work page doi:10.1109/wi- 2010
[8]

Common Crawl. 2024. Common Crawl - Open Repository of Web Crawl Data. https://commoncrawl.org/

work page 2024
[9]

Common Crawl. 2024. Statistics of Common Crawl Monthly Archives by com- moncrawl. https://commoncrawl.github.io/cc-crawl-statistics/plots/languages

work page 2024
[10]

Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. http://arxiv.org/abs/2104.08758 arXiv:2104.08758 [cs]

work page arXiv 2021
[11]

Julia Efremova, Ian Endres, Isaac Vidas, and Ofer Melnik. 2018. A Geo-Tagging Framework for Address Extraction from Web Pages. In Advances in Data Mining. Applications and Theoretical Aspects, Petra Perner (Ed.). Springer International Publishing, Cham, 288–295. https://doi.org/10.1007/978-3-319-95786-9_22

work page doi:10.1007/978-3-319-95786-9_22 2018
[12]

Nir Fulman, Abdulkadir Memduhoğlu, and Alexander Zipf. 2024. Evidence for Systematic Bias in the Spatial Memory of Large Language Models. (2024)

work page 2024
[13]

Wes Gurnee and Max Tegmark. 2024. Language Models Represent Space and Time. https://doi.org/10.48550/arXiv.2310.02207 arXiv:2310.02207 [cs]

work page doi:10.48550/arxiv.2310.02207 2024
[14]

Ilya Ilyankou, Meihui Wang, James Haworth, and Stefano Cavazzi. 2024. CC- GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl. https://doi.org/10.48550/arXiv.2405.11039 arXiv:2405.11039 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.11039 2024
[15]

ISO. 2012. geographic data — MLGT: The authoritative multi-lingual geographic information terminology database. https://isotc211.geolexica.org/concepts/202/

work page 2012
[16]

I Think i Discovered a Military Base in the Middle of the Ocean

Levente Juhasz and Peter Mooney. 2022. “I Think i Discovered a Military Base in the Middle of the Ocean”—Null Island, the Most Real of Fictional Places. IEEE Access 10 (2022), 84147–84165. https://doi.org/10.1109/ACCESS.2022.3197222 Conference’17, July 2017, Washington, DC, USA Ilya Ilyankou, Meihui Wang, Stefano Cavazzi, and James Haworth

work page doi:10.1109/access.2022.3197222 2022
[17]

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasan- bayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Clay- tone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Or- tiz Suarez, Iroro Orife, Kelechi Ogueji, Andre N...

work page
[18]

Transactions of the Association for Computational Linguistics 10 (Jan

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics 10 (Jan. 2022), 50–72. https://doi.org/10.1162/tacl_a_00447 arXiv:2103.12028 [cs]

work page doi:10.1162/tacl_a_00447 2022
[19]

Alexandra Sasha Luccioni and Joseph D. Viviano. 2021. What’s in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus. http://arxiv.org/abs/2105.02732 arXiv:2105.02732 [cs]

work page arXiv 2021
[20]

Gengchen Mai, Weiming Huang, Jin Sun, Suhang Song, Deepak Mishra, Ninghao Liu, Song Gao, Tianming Liu, Gao Cong, Yingjie Hu, Chris Cundy, Ziyuan Li, Rui Zhu, and Ni Lao. 2023. On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence. http://arxiv.org/abs/2304.06798 arXiv:2304.06798 [cs]

work page arXiv 2023
[21]

Rohin Manvi, Samar Khanna, Marshall Burke, David Lobell, and Stefano Ermon

work page
[22]

arXiv:2402.02680, 2024

Large Language Models are Geographically Biased. http://arxiv.org/abs/ 2402.02680 arXiv:2402.02680 [cs]

work page arXiv
[23]

Mazda Moayeri and Soheil Feizi. 2024. WorldBench: Quantifying Geographic Disparities in LLM Factual Recall. https://openreview.net/forum?id=fubvUIBggI

work page 2024
[24]

Hannes Mühleisen and Christian Bizer. 2012. Web Data Commons – Extracting Structured Data from Two Large Web Corpora. (2012)

work page 2012
[25]

John Paolillo, Daniel Pimienta, and Daniel Prado. 2005. Measuring Linguistic Diversity on the Internet. (2005)

work page 2005
[26]

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. http://arxiv.org/abs/2306. 01116 arXiv:2306.01116 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Im- proving Language Understanding by Generative Pre-Training. (2018)

work page 2018
[28]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. http: //arxiv.org/abs/1910.10683 arXiv:1910.10683 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[29]

2013.Linked Data: Structured data on the Web

Luke Ruth, David Wood, Marsha Zaidman, and Michael Hausenblas. 2013.Linked Data: Structured data on the Web . Simon and Schuster. Google-Books-ID: hTozEAAAQBAJ

work page 2013
[30]

Sebastian Schmidt, Simon Manschitz, Christoph Rensing, and Ralf Steinmetz

work page
[31]

In Proceedings of the 13th International Conference on Knowl- edge Management and Knowledge Technologies

Extraction of Address Data from Unstructured Text using Free Knowl- edge Resources. In Proceedings of the 13th International Conference on Knowl- edge Management and Knowledge Technologies . ACM, Graz Austria, 1–8. https: //doi.org/10.1145/2494188.2494193

work page doi:10.1145/2494188.2494193
[32]

Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent Romary. 2021. Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastruc- tures. (2021)

work page 2021
[33]

Gemini Team. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Willem Robert Van Hage, Thomas Ploeger, and Jesper Hoeksema. 2014. Number frequency on the web. InProceedings of the 23rd International Conference on World Wide Web. ACM, Seoul Korea, 571–572. https://doi.org/10.1145/2567948.2576962

work page doi:10.1145/2567948.2576962 2014
[35]

W3Techs. 2024. Usage Statistics and Market Share of Top Level Domains for Websites, August 2024. https://w3techs.com/technologies/overview/top_level_ domain

work page 2024
[36]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022. Finetuned Language Models Are Zero-Shot Learners. http://arxiv.org/abs/2109.01652 arXiv:2109.01652 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Tingyu Xie, Qi Li, Jian Zhang, Yan Zhang, Zuozhu Liu, and Hongwei Wang. 2023. Empirical Study of Zero-Shot NER with ChatGPT. (2023)

work page 2023
[38]

Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, and Hoifung Poon. 2024. UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition. http://arxiv.org/abs/2308.03279 arXiv:2308.03279 [cs]. Received 31 May 2024

work page arXiv 2024

[1] [1]

Stefan Baack. 2024. A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl. In The 2024 ACM Conference on Fairness, Accountability, and Transparency. ACM, Rio de Janeiro Brazil, 2199–2208. https: //doi.org/10.1145/3630106.3659033

work page doi:10.1145/3630106.3659033 2024

[2] [2]

Stefan Baack. 2024. Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI. (Feb. 2024)

work page 2024

[3] [3]

James E Bartlett, Joe W Kotrlik, and Chadwick C Higgins. 2001. Organizational Research: Determining Appropriate Sample Size in Survey Research. (2001)

work page 2001

[4] [4]

Prabin Bhandari, Antonios Anastasopoulos, and Dieter Pfoser. 2023. Are Large Language Models Geospatially Knowledgeable?. In Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems (SIGSPA- TIAL ’23). Association for Computing Machinery, New York, NY, USA, 1–4. https://doi.org/10.1145/3589132.3625625

work page doi:10.1145/3589132.3625625 2023

[5] [5]

Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021. Multimodal datasets: misogyny, pornography, and malignant stereotypes. http://arxiv.org/ abs/2110.01963 arXiv:2110.01963 [cs]

work page arXiv 2021

[6] [6]

John Bossler. 2010. Manual of Geospatial Science and Technology . CRC Press. Google-Books-ID: UdZ3uDekqwwC

work page 2010

[7] [7]

Chia-Hui Chang and Shu-Ying Li. 2010. MapMarker: Extraction of Postal Addresses and Associated Information for General Web Pages. In 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. IEEE, Toronto, AB, Canada, 105–111. https://doi.org/10.1109/WI- IAT.2010.64

work page doi:10.1109/wi- 2010

[8] [8]

Common Crawl. 2024. Common Crawl - Open Repository of Web Crawl Data. https://commoncrawl.org/

work page 2024

[9] [9]

Common Crawl. 2024. Statistics of Common Crawl Monthly Archives by com- moncrawl. https://commoncrawl.github.io/cc-crawl-statistics/plots/languages

work page 2024

[10] [10]

Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. http://arxiv.org/abs/2104.08758 arXiv:2104.08758 [cs]

work page arXiv 2021

[11] [11]

Julia Efremova, Ian Endres, Isaac Vidas, and Ofer Melnik. 2018. A Geo-Tagging Framework for Address Extraction from Web Pages. In Advances in Data Mining. Applications and Theoretical Aspects, Petra Perner (Ed.). Springer International Publishing, Cham, 288–295. https://doi.org/10.1007/978-3-319-95786-9_22

work page doi:10.1007/978-3-319-95786-9_22 2018

[12] [12]

Nir Fulman, Abdulkadir Memduhoğlu, and Alexander Zipf. 2024. Evidence for Systematic Bias in the Spatial Memory of Large Language Models. (2024)

work page 2024

[13] [13]

Wes Gurnee and Max Tegmark. 2024. Language Models Represent Space and Time. https://doi.org/10.48550/arXiv.2310.02207 arXiv:2310.02207 [cs]

work page doi:10.48550/arxiv.2310.02207 2024

[14] [14]

Ilya Ilyankou, Meihui Wang, James Haworth, and Stefano Cavazzi. 2024. CC- GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl. https://doi.org/10.48550/arXiv.2405.11039 arXiv:2405.11039 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.11039 2024

[15] [15]

ISO. 2012. geographic data — MLGT: The authoritative multi-lingual geographic information terminology database. https://isotc211.geolexica.org/concepts/202/

work page 2012

[16] [16]

I Think i Discovered a Military Base in the Middle of the Ocean

Levente Juhasz and Peter Mooney. 2022. “I Think i Discovered a Military Base in the Middle of the Ocean”—Null Island, the Most Real of Fictional Places. IEEE Access 10 (2022), 84147–84165. https://doi.org/10.1109/ACCESS.2022.3197222 Conference’17, July 2017, Washington, DC, USA Ilya Ilyankou, Meihui Wang, Stefano Cavazzi, and James Haworth

work page doi:10.1109/access.2022.3197222 2022

[17] [17]

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasan- bayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Clay- tone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Or- tiz Suarez, Iroro Orife, Kelechi Ogueji, Andre N...

work page

[18] [18]

Transactions of the Association for Computational Linguistics 10 (Jan

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics 10 (Jan. 2022), 50–72. https://doi.org/10.1162/tacl_a_00447 arXiv:2103.12028 [cs]

work page doi:10.1162/tacl_a_00447 2022

[19] [19]

Alexandra Sasha Luccioni and Joseph D. Viviano. 2021. What’s in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus. http://arxiv.org/abs/2105.02732 arXiv:2105.02732 [cs]

work page arXiv 2021

[20] [20]

Gengchen Mai, Weiming Huang, Jin Sun, Suhang Song, Deepak Mishra, Ninghao Liu, Song Gao, Tianming Liu, Gao Cong, Yingjie Hu, Chris Cundy, Ziyuan Li, Rui Zhu, and Ni Lao. 2023. On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence. http://arxiv.org/abs/2304.06798 arXiv:2304.06798 [cs]

work page arXiv 2023

[21] [21]

Rohin Manvi, Samar Khanna, Marshall Burke, David Lobell, and Stefano Ermon

work page

[22] [22]

arXiv:2402.02680, 2024

Large Language Models are Geographically Biased. http://arxiv.org/abs/ 2402.02680 arXiv:2402.02680 [cs]

work page arXiv

[23] [23]

Mazda Moayeri and Soheil Feizi. 2024. WorldBench: Quantifying Geographic Disparities in LLM Factual Recall. https://openreview.net/forum?id=fubvUIBggI

work page 2024

[24] [24]

Hannes Mühleisen and Christian Bizer. 2012. Web Data Commons – Extracting Structured Data from Two Large Web Corpora. (2012)

work page 2012

[25] [25]

John Paolillo, Daniel Pimienta, and Daniel Prado. 2005. Measuring Linguistic Diversity on the Internet. (2005)

work page 2005

[26] [26]

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. http://arxiv.org/abs/2306. 01116 arXiv:2306.01116 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Im- proving Language Understanding by Generative Pre-Training. (2018)

work page 2018

[28] [28]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. http: //arxiv.org/abs/1910.10683 arXiv:1910.10683 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2020

[29] [29]

2013.Linked Data: Structured data on the Web

Luke Ruth, David Wood, Marsha Zaidman, and Michael Hausenblas. 2013.Linked Data: Structured data on the Web . Simon and Schuster. Google-Books-ID: hTozEAAAQBAJ

work page 2013

[30] [30]

Sebastian Schmidt, Simon Manschitz, Christoph Rensing, and Ralf Steinmetz

work page

[31] [31]

In Proceedings of the 13th International Conference on Knowl- edge Management and Knowledge Technologies

Extraction of Address Data from Unstructured Text using Free Knowl- edge Resources. In Proceedings of the 13th International Conference on Knowl- edge Management and Knowledge Technologies . ACM, Graz Austria, 1–8. https: //doi.org/10.1145/2494188.2494193

work page doi:10.1145/2494188.2494193

[32] [32]

Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent Romary. 2021. Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastruc- tures. (2021)

work page 2021

[33] [33]

Gemini Team. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Willem Robert Van Hage, Thomas Ploeger, and Jesper Hoeksema. 2014. Number frequency on the web. InProceedings of the 23rd International Conference on World Wide Web. ACM, Seoul Korea, 571–572. https://doi.org/10.1145/2567948.2576962

work page doi:10.1145/2567948.2576962 2014

[35] [35]

W3Techs. 2024. Usage Statistics and Market Share of Top Level Domains for Websites, August 2024. https://w3techs.com/technologies/overview/top_level_ domain

work page 2024

[36] [36]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022. Finetuned Language Models Are Zero-Shot Learners. http://arxiv.org/abs/2109.01652 arXiv:2109.01652 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

Tingyu Xie, Qi Li, Jian Zhang, Yan Zhang, Zuozhu Liu, and Hongwei Wang. 2023. Empirical Study of Zero-Shot NER with ChatGPT. (2023)

work page 2023

[38] [38]

Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, and Hoifung Poon. 2024. UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition. http://arxiv.org/abs/2308.03279 arXiv:2308.03279 [cs]. Received 31 May 2024

work page arXiv 2024