CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl

Ilya Ilyankou; James Haworth; Meihui Wang; Stefano Cavazzi

arxiv: 2405.11039 · v3 · submitted 2024-05-17 · 💻 cs.CL

CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl

Ilya Ilyankou , Meihui Wang , Stefano Cavazzi , James Haworth This is my paper

Pith reviewed 2026-05-24 01:04 UTC · model grok-4.3

classification 💻 cs.CL

keywords Common CrawlGPX filesgeospatial datamultimodal datasetuser-generated trackstrajectory dataoutdoor activities

0 comments

The pith

A pipeline extracts 1,416 human-written descriptions paired with GPX tracks from Common Crawl.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method to locate GPX files within the massive Common Crawl web archive and turn them into a clean dataset. It yields 1,416 pairings of textual descriptions with MultiLineString vector representations of user-generated tracks from the six newest releases. The resulting resource supplies real human data on outdoor movement instead of relying on synthetic routes. Researchers can therefore examine actual activity patterns and train models for trajectory tasks directly on observed examples.

Core claim

The authors describe an efficient pipeline that locates GPX files in the six latest Common Crawl releases, parses valid user-generated tracks into MultiLineString format, and matches them with accompanying human-written descriptions to form a multimodal dataset of 1,416 entries.

What carries the argument

The extraction pipeline that scans Common Crawl for GPX files, validates and parses them into vector geometry, and associates them with textual descriptions.

If this is right

The dataset allows direct study of real outdoor activity patterns from actual recorded tracks.
Trajectory generation models can be trained on authentic human routes rather than synthetic ones.
Track annotation models gain paired text-geometry examples for supervised learning.
The resource supports any geospatial task that benefits from observed rather than generated routes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the same pipeline to older Common Crawl releases would likely produce a substantially larger collection.
The paired descriptions could be analyzed for recurring linguistic patterns in how people describe specific terrain or activities.
Linking the tracks to additional web metadata might enable studies of how location context influences route choice.

Load-bearing premise

GPX files found in Common Crawl mostly contain genuine user tracks that are accurately paired with their descriptions and parse cleanly into geometry without substantial noise or errors.

What would settle it

A random sample audit revealing that more than 20 percent of the extracted pairs contain mismatched descriptions or invalid non-user tracks would show the pipeline does not deliver high-quality annotated data.

Figures

Figures reproduced from arXiv: 2405.11039 by Ilya Ilyankou, James Haworth, Meihui Wang, Stefano Cavazzi.

**Figure 2.** Figure 2: A 13.8 km (8.6 mi) circular route in Germany. The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The tracks in the resulting dataset represent real-life outdoor activities that were either recorded (i.e., completed) or planned using GIS software. Thus, these routes can be used in place of synthetically generated trajectory datasets, as well as for training 12https://github.com/aboSamoor/pycld2 13https://github.com/argosopentech/argos-translate 14https://opennmt.net/ 15https://www.earthdata.nasa.gov/se… view at source ↗

**Figure 3.** Figure 3: Select dataset properties [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

The Common Crawl (CC) corpus is the largest open web crawl dataset containing 9.5+ petabytes of data captured since 2008. The dataset is instrumental in training large language models, and as such it has been studied for (un)desirable content, and distilled for smaller, domain-specific datasets. However, to our knowledge, no research has been dedicated to using CC as a source of annotated geospatial data. In this paper, we introduce an efficient pipeline to extract annotated user-generated tracks from GPX files found in CC, and the resulting multimodal dataset with 1,416 pairings of human-written descriptions and MultiLineString vector data from the 6 most recent CC releases. The dataset can be used to study people's outdoor activity patterns, the way people talk about their outdoor experiences, as well as for developing trajectory generation or track annotation models, or for various other problems in place of synthetically generated routes. Our reproducible code is available on GitHub: https://github.com/ilyankou/cc-gpx

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extracts 1,416 GPX-description pairs from Common Crawl with open code but reports no validation of the pairs.

read the letter

The one thing to know is that this paper pulls a small set of GPX data from Common Crawl but does not validate the quality of the extracted pairs. The authors introduce an efficient pipeline for this extraction and release the resulting 1,416 multimodal samples along with their code. This is genuinely new because no one has mined Common Crawl for annotated geospatial GPX data before. They do well to focus on recent crawls and to make everything reproducible. The GitHub link is a plus for anyone who wants to run or improve the process. The main weakness is the absence of any quality evidence. There are no metrics on how accurately the GPX files convert to MultiLineString geometry, no checks on whether the descriptions match the tracks, and no filtering details or error rates. This makes the 'high-quality' claim hard to assess from the text alone. A reader interested in geospatial language modeling or outdoor activity analysis could use this as a starting point, but they would need to add their own validation. The dataset size is modest, so it is not going to change the field, but it is a practical contribution for niche uses. I would bring this to a reading group to talk about web data mining techniques. I would not cite it in my own work until validation is added. It should go through peer review because the core idea is sound and the code is public, even if the current version needs more support for the quality assertions.

Referee Report

2 major / 1 minor

Summary. The paper introduces an efficient pipeline to extract annotated user-generated tracks from GPX files found in Common Crawl, and releases the resulting multimodal dataset consisting of 1,416 pairings of human-written descriptions and MultiLineString vector data drawn from the six most recent CC releases. The work positions the dataset as a resource for studying outdoor activity patterns, trajectory generation, and track annotation models, with reproducible code provided on GitHub.

Significance. If the extracted pairs prove to be valid user-generated tracks with accurate geometry and meaningful description correspondence, the dataset would offer a rare real-world alternative to synthetic geospatial data for multimodal modeling tasks in NLP and GIS. The emphasis on reproducibility via open code is a clear strength that supports potential reuse.

major comments (2)

[Abstract / pipeline description] Abstract and pipeline description: The central claim that the output constitutes 'high-quality annotated geospatial data' rests on the assumptions that discovered GPX files are valid user-generated tracks, parse accurately into MultiLineString geometry, and are meaningfully paired with human-written descriptions. No error rates, fidelity metrics, manual audit results, or filtering criteria are reported to substantiate these conditions for the final 1,416 pairs.
[Results] Results section: The manuscript reports only the final count of 1,416 pairings without any quantitative or qualitative validation (e.g., sample inspection for parsing failures, description-track alignment, or noise levels), leaving the quality claim unsupported and preventing assessment of whether the dataset meets the standards implied by its intended downstream uses.

minor comments (1)

[Methods] The GitHub link is provided but no details on exact CC release identifiers or crawl dates used are given in the text, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. The comments highlight the need for stronger substantiation of the dataset quality, which we address below. We propose revisions to incorporate additional validation details while maintaining the focus on the extraction pipeline and released dataset.

read point-by-point responses

Referee: [Abstract / pipeline description] Abstract and pipeline description: The central claim that the output constitutes 'high-quality annotated geospatial data' rests on the assumptions that discovered GPX files are valid user-generated tracks, parse accurately into MultiLineString geometry, and are meaningfully paired with human-written descriptions. No error rates, fidelity metrics, manual audit results, or filtering criteria are reported to substantiate these conditions for the final 1,416 pairs.

Authors: We agree that the manuscript would be strengthened by explicit reporting of filtering criteria and validation metrics. The pipeline section describes the steps for identifying GPX files, parsing them into MultiLineString geometries, and pairing with descriptions, including basic inclusion criteria such as file size and content presence. However, quantitative error rates and sample audits are not currently included. In the revised version, we will expand the pipeline description to detail all filtering criteria and add a validation subsection reporting parsing success rates and results from a manual inspection of a random sample of pairs for geometry validity and description alignment. revision: yes
Referee: [Results] Results section: The manuscript reports only the final count of 1,416 pairings without any quantitative or qualitative validation (e.g., sample inspection for parsing failures, description-track alignment, or noise levels), leaving the quality claim unsupported and preventing assessment of whether the dataset meets the standards implied by its intended downstream uses.

Authors: The results section presents the scale of extraction across the six CC releases as the primary outcome. We acknowledge that this leaves the quality unsupported by direct evidence in the current draft. We will revise the results section to include a new subsection with quantitative validation metrics (such as the proportion of files successfully parsed) and qualitative assessment via sample review, to better support the dataset's suitability for the intended uses. revision: yes

Circularity Check

0 steps flagged

No circularity: pure data extraction pipeline

full rationale

The paper presents a procedural pipeline for locating, parsing, and pairing GPX files from Common Crawl releases with textual descriptions. No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes appear. The central output (1,416 pairs) is produced by explicit filtering and parsing steps whose validity is an empirical claim, not a definitional reduction. No self-citation chain supports any load-bearing premise. This is the expected non-finding for a data-release paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The pipeline rests on the domain assumption that GPX files in the web archive are parseable and that associated text constitutes valid human annotations for the tracks.

axioms (1)

domain assumption GPX files present in Common Crawl are valid, downloadable, and contain user-generated track data paired with descriptions.
The extraction process presupposes that the files can be located, downloaded, and parsed without major corruption or mismatch between geometry and text.

pith-pipeline@v0.9.0 · 5721 in / 1157 out tokens · 25630 ms · 2026-05-24T01:04:53.751756+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Quantifying Geospatial in the Common Crawl Corpus
cs.CL 2024-06 unverdicted novelty 5.0

Analysis estimates 18.7% of Common Crawl documents contain geospatial information like coordinates and addresses, with little difference by language.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · cited by 1 Pith paper

[1]

Meta AI. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/

work page 2024
[2]

Stefan Baack. 2024. Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI. (Feb. 2024)

work page 2024
[3]

Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, and Zoraida Callejas. 2022. esCorpius: A Massive Spanish Crawling Corpus. http://arxiv.org/abs/2206.15147 arXiv:2206.15147 [cs]

work page arXiv 2022
[4]

Armağan Karahanoğlu, Rúben Gouveia, Jasper Reenalda, and Geke Ludden. 2021. How Are Sports-Trackers Used by Runners? Running-Related Data, Personal Goals, and Self-Tracking in Running. Sensors 21, 11 (Jan. 2021), 3687. https: //doi.org/10.3390/s21113687 Number: 11 Publisher: Multidisciplinary Digital Publishing Institute

work page doi:10.3390/s21113687 2021
[5]

Alexandra Sasha Luccioni and Joseph D. Viviano. 2021. What’s in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus. http://arxiv.org/abs/2105.02732 arXiv:2105.02732 [cs]

work page arXiv 2021
[6]

Smith, Herve Saint-Amand, Magdalena Plamada, Philipp Koehn, Chris Callison-Burch, and Adam Lopez

Jason R. Smith, Herve Saint-Amand, Magdalena Plamada, Philipp Koehn, Chris Callison-Burch, and Adam Lopez. 2013. Dirt cheap web-scale parallel text from the Common Crawl. In Smith, Jason R; Saint-Amand, Herve; Plamada, Magdalena; Koehn, Philipp; Callison-Burch, Chris; Lopez, Adam (2013). Dirt cheap web-scale parallel text from the Common Crawl. In: 51st A...

work page doi:10.5167/uzh-80038 2013
[7]

Alan D Thompson. 2022. What’s in my AI? A Comprehensive Analysis of Datasets Used to Train GPT-1, GPT-2, GPT-3, GPT-NeoX-20B, Megatron-11B, MT-NLG, and Gopher. (2022)

work page 2022
[8]

Michał Turski, Tomasz Stanisławek, Karol Kaczmarek, Paweł Dyda, and Filip Gral- iński. 2023. CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data. In Document Analysis and Recognition - ICDAR 2023 , Ger- not A. Fink, Rajiv Jain, Koichi Kise, and Richard Zanibbi (Eds.). Springer Nature Switzerland, Cham, 348–365. https://do...

work page doi:10.1007/978-3-031-41682-8_22 2023
[9]

Maurice Weber, Carlo Siebenschuh, Rory M Butler, Anton Alexandrov, Valde- mar R Thanner, Georgios Tsolakis, Haris Jabbar, Ian Foster, Bo Li, and Rick Stevens

work page
[10]

WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data. (2023)

work page 2023
[11]

Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. 2023. Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis. http://arxiv.org/abs/2304. 04675 arXiv:2304.04675 [cs]. Received 29 May 2024

work page arXiv 2023

[1] [1]

Meta AI. 2024. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/

work page 2024

[2] [2]

Stefan Baack. 2024. Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI. (Feb. 2024)

work page 2024

[3] [3]

Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, and Zoraida Callejas. 2022. esCorpius: A Massive Spanish Crawling Corpus. http://arxiv.org/abs/2206.15147 arXiv:2206.15147 [cs]

work page arXiv 2022

[4] [4]

Armağan Karahanoğlu, Rúben Gouveia, Jasper Reenalda, and Geke Ludden. 2021. How Are Sports-Trackers Used by Runners? Running-Related Data, Personal Goals, and Self-Tracking in Running. Sensors 21, 11 (Jan. 2021), 3687. https: //doi.org/10.3390/s21113687 Number: 11 Publisher: Multidisciplinary Digital Publishing Institute

work page doi:10.3390/s21113687 2021

[5] [5]

Alexandra Sasha Luccioni and Joseph D. Viviano. 2021. What’s in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus. http://arxiv.org/abs/2105.02732 arXiv:2105.02732 [cs]

work page arXiv 2021

[6] [6]

Smith, Herve Saint-Amand, Magdalena Plamada, Philipp Koehn, Chris Callison-Burch, and Adam Lopez

Jason R. Smith, Herve Saint-Amand, Magdalena Plamada, Philipp Koehn, Chris Callison-Burch, and Adam Lopez. 2013. Dirt cheap web-scale parallel text from the Common Crawl. In Smith, Jason R; Saint-Amand, Herve; Plamada, Magdalena; Koehn, Philipp; Callison-Burch, Chris; Lopez, Adam (2013). Dirt cheap web-scale parallel text from the Common Crawl. In: 51st A...

work page doi:10.5167/uzh-80038 2013

[7] [7]

Alan D Thompson. 2022. What’s in my AI? A Comprehensive Analysis of Datasets Used to Train GPT-1, GPT-2, GPT-3, GPT-NeoX-20B, Megatron-11B, MT-NLG, and Gopher. (2022)

work page 2022

[8] [8]

Michał Turski, Tomasz Stanisławek, Karol Kaczmarek, Paweł Dyda, and Filip Gral- iński. 2023. CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data. In Document Analysis and Recognition - ICDAR 2023 , Ger- not A. Fink, Rajiv Jain, Koichi Kise, and Richard Zanibbi (Eds.). Springer Nature Switzerland, Cham, 348–365. https://do...

work page doi:10.1007/978-3-031-41682-8_22 2023

[9] [9]

Maurice Weber, Carlo Siebenschuh, Rory M Butler, Anton Alexandrov, Valde- mar R Thanner, Georgios Tsolakis, Haris Jabbar, Ian Foster, Bo Li, and Rick Stevens

work page

[10] [10]

WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data. (2023)

work page 2023

[11] [11]

Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. 2023. Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis. http://arxiv.org/abs/2304. 04675 arXiv:2304.04675 [cs]. Received 29 May 2024

work page arXiv 2023