Smart Bilingual Focused Crawling of Parallel Documents

Cristian Garc\'ia-Romero; Felipe S\'anchez-Mart\'inez; Miquel Espl\`a-Gomis

arxiv: 2405.14779 · v2 · submitted 2024-05-23 · 💻 cs.CL · cs.LG

Smart Bilingual Focused Crawling of Parallel Documents

Cristian Garc\'ia-Romero , Miquel Espl\`a-Gomis , Felipe S\'anchez-Mart\'inez This is my paper

Pith reviewed 2026-05-24 00:51 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords focused crawlingparallel textslanguage identificationparallelism detectionTransformerbilingual corporaweb crawlingURL analysis

0 comments

The pith

Neural models that read language and parallelism from URLs alone let crawlers locate more parallel documents while skipping many useless pages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to replace the usual brute-force download of web pages with a guided crawl that uses two predictions made from URLs before full documents are fetched. A Transformer encoder is fine-tuned first to guess a page's language from its URL alone and second to decide whether any two URLs point to mutual translations. These signals are then used inside the crawler to prioritize links that are likely to yield parallel content for a chosen language pair. If the approach works, the same crawl effort produces more usable bilingual text and fewer wasted downloads.

Core claim

By fine-tuning a pre-trained multilingual Transformer encoder for the tasks of inferring document language from a single URL and inferring parallelism from a pair of URLs, the method enables early identification of parallel content during crawling; the resulting system downloads fewer non-parallel documents and returns a larger set of parallel pairs than conventional unguided crawling for the same language pair.

What carries the argument

A single fine-tuned Transformer encoder performing two URL-based tasks: language identification from one URL and parallelism judgment from a URL pair.

If this is right

The crawler can abandon or deprioritize branches of the web graph once language or parallelism predictions indicate low yield.
For a fixed download budget, the number of extracted parallel documents increases.
The same two models can be retrained for any new language pair provided suitable labeled URL data exist.
Bandwidth and storage are saved because non-parallel pages are often identified before their full content is fetched.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The URL-only approach could be combined with lightweight content sampling after the first few bytes to handle sites whose URLs are uninformative.
Extending the same training procedure to additional language pairs would test whether the reported gains generalize beyond the pairs evaluated in the paper.
The method's efficiency gain depends on how often real web URLs carry language or parallelism cues; sites that hide such cues would limit the advantage.

Load-bearing premise

The models trained on the chosen data will keep their accuracy when applied to the URLs that appear during live crawling for the target language pair.

What would settle it

Run the smart crawler and a standard crawler from identical seed URLs for the same language pair and measure parallel pairs found per document downloaded; if the smart crawler does not collect more parallel pairs or does not reduce useless downloads, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2405.14779 by Cristian Garc\'ia-Romero, Felipe S\'anchez-Mart\'inez, Miquel Espl\`a-Gomis.

**Figure 1.** Figure 1: Architecture of the model used for language identification from URLs. data in such a way that URLs from the same web domain only appear in one of the splits [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Language distribution for the training, development and test sets used in the experiments for language indentification [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Language identification results on a per-language basis, comparing our model with the baseline (left subfigure) and with a FastText model trained on the same data (right subfigure). Only languages with a minimum of 100 URLs and 10 different web domains in the test set are included [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Architecture of the model used for inferring parallelness from URL pairs. linked URL. Finally, we check if one of these pairs appears in the gold standard, and if it does, we label the rest of pairs as non-parallel; otherwise we discard all the pairs in the set. We alleviate the other two limitations by extending our training data with the MaCoCu (Bañón et al. 2022) v2 corpus, which covers 10 additional la… view at source ↗

**Figure 5.** Figure 5: Macro F1 scores per language (paired with English) for the parallelness identifier from URLs on the dataset described in Section 4.1. 5. The model that infers parallelness from URLs is used to obtain the probability of each pair {(u, vi )}N 1 of being parallel. 6. Each URL vi is added to the list of pending downloads with a priority score determined by multiplying the probabilities obtained in steps 4 and … view at source ↗

**Figure 6.** Figure 6: General architecture of our approach for smart bilingual focused crawling of parallel documents. In this example, A and B are English and Spanish, the language L ′ of the downloaded document is English, and the language L ′′ used to obtain the language probabilities is Spanish. Evaluation. We measure the performance of our approach in terms of the amount of parallel data downloaded at different moments of … view at source ↗

**Figure 7.** Figure 7: For the three crawlers (Heritrix, Heritrix+CLD2 and Heritrix+CLD2+smart) and the language pairs eng-isl, eng-mlt, eng-fin, and spa-eus, thousands of parallel documents retrieved (y-axis) as a function of the percentage of documents downloaded (x-axis) from each website. Each data point accumulates the number of parallel documents downloaded from each website up to the percentage indicated on the x-axis. id… view at source ↗

read the original abstract

Crawling parallel texts -- texts that are mutual translations -- from the Internet is usually done following a brute-force approach: documents are massively downloaded in an unguided process, and only a fraction of them end up leading to actual parallel content. In this work we propose a smart crawling method that guides the crawl towards finding parallel content more rapidly. We follow a neural approach that consists in adapting a pre-trained multilingual language model based on the encoder of the Transformer architecture by fine-tuning it for two new tasks: inferring the language of a document from its Uniform Resource Locator (URL), and inferring whether a pair of URLs link to parallel documents. We evaluate both models in isolation and their integration into a crawling tool. The results demonstrate the individual effectiveness of both models, and highlight that their combination enables us to address a practical engineering challenge: the early discovery of parallel content during web crawling in a given language pair. This leads to a reduction in the amount of downloaded documents deemed useless, and yields a greater quantity of parallel documents compared to conventional crawling approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts Transformers to score URLs for language and parallelism then folds those scores into a crawler, but the abstract gives no metrics or baselines so the gains stay unverified.

read the letter

The new piece here is training a multilingual encoder on two URL-specific tasks: one that predicts document language from the URL alone, and another that predicts whether a pair of URLs point to parallel content. Those predictions then steer the crawler instead of downloading everything first. That combination is a direct engineering response to the waste in standard parallel-text collection for machine translation. The framing of the problem is clear and the tasks are a non-trivial reuse of existing encoder models rather than a trivial add-on. If the full experiments show solid precision on real crawl URLs, the approach could cut the fraction of useless downloads in bilingual crawling pipelines. The main weakness is that the abstract asserts higher yield and lower waste without any numbers, baselines, dataset descriptions, or error analysis. The stress-test point on generalization is also live: the models are fine-tuned on chosen collections, but nothing in the provided text shows they hold up on the noisy, long-tail URLs that appear in actual crawls. If accuracy drops there, the early-pruning benefit disappears and the system reverts to ordinary crawling. This work is aimed at applied NLP groups that maintain parallel corpora. A reader building data pipelines would get value from the method description and the two auxiliary tasks even if the quantitative claims need checking. The paper shows clear thinking on a recurring engineering cost and engages the practical literature, so it deserves a serious referee to evaluate the experiments and the generalization evidence.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a neural smart crawling method for parallel documents that fine-tunes a pre-trained multilingual Transformer encoder on two URL-based tasks: language identification from a single URL and parallelism detection from URL pairs. It claims that integrating these models enables earlier discovery of parallel content during crawling for a given language pair, thereby reducing the volume of useless documents downloaded and increasing the quantity of parallel documents obtained relative to conventional unguided crawling.

Significance. If the empirical claims are substantiated, the work could improve the efficiency of harvesting parallel corpora for machine translation, especially for lower-resource language pairs where brute-force crawling is particularly wasteful. The approach builds on standard fine-tuning of encoder-only models for auxiliary prediction tasks, which is a technically straightforward but potentially useful engineering contribution when the generalization holds.

major comments (2)

[Abstract] Abstract: the central claims that the method 'leads to a reduction in the amount of downloaded documents deemed useless, and yields a greater quantity of parallel documents compared to conventional crawling approaches' are asserted without any quantitative metrics, baseline comparisons, dataset descriptions, or error analysis. This absence prevents assessment of whether the claimed benefits are realized.
[Evaluation section] Evaluation (or equivalent results section): no experiments are reported that test model performance or crawling yield on URLs drawn from live web crawls rather than the training distribution. The central engineering claim depends on the models maintaining high precision on the long-tail, noisy URLs encountered in practice; without such evidence the early-pruning benefit is unverified.

minor comments (1)

The manuscript should include explicit descriptions of the training corpora, fine-tuning hyperparameters, and evaluation metrics used for the two auxiliary tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims that the method 'leads to a reduction in the amount of downloaded documents deemed useless, and yields a greater quantity of parallel documents compared to conventional crawling approaches' are asserted without any quantitative metrics, baseline comparisons, dataset descriptions, or error analysis. This absence prevents assessment of whether the claimed benefits are realized.

Authors: We agree that the abstract would be strengthened by including concrete quantitative support for the central claims. In the revised manuscript we will add specific metrics (e.g., percentage reduction in useless downloads and increase in parallel documents found) drawn from the evaluation section, along with brief references to the datasets and baselines used. revision: yes
Referee: [Evaluation section] Evaluation (or equivalent results section): no experiments are reported that test model performance or crawling yield on URLs drawn from live web crawls rather than the training distribution. The central engineering claim depends on the models maintaining high precision on the long-tail, noisy URLs encountered in practice; without such evidence the early-pruning benefit is unverified.

Authors: The evaluation section reports results on a large held-out test set of real-world URLs collected independently from the training data; these URLs exhibit the noise and diversity typical of web content. We will revise the text to explicitly describe the construction and characteristics of this test set and to discuss its relation to live-crawl distributions. While a full online crawling experiment would provide additional validation, the current offline evaluation already demonstrates the early-pruning benefit under realistic URL conditions. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical ML application with external evaluation

full rationale

The paper trains two fine-tuned Transformer models on URL language ID and parallelism detection tasks, then measures their effect on crawling yield versus baseline crawlers. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim rests on measured reductions in useless downloads during live crawling, which is evaluated separately from the training objectives and does not reduce to a definitional identity or input fit.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on the transferability of a pre-trained multilingual encoder to URL classification tasks and on the existence of detectable URL patterns that correlate with language and parallelism.

free parameters (1)

fine-tuning hyperparameters
Learning rate, batch size, and number of epochs for the two new tasks are chosen during fine-tuning and affect downstream crawling performance.

axioms (1)

domain assumption A pre-trained multilingual Transformer encoder can be successfully adapted to URL-based language identification and parallelism detection via fine-tuning.
The method invokes this transfer-learning premise to justify the two new tasks.

pith-pipeline@v0.9.0 · 5721 in / 1222 out tokens · 23663 ms · 2026-05-24T00:51:24.636740+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

InProceedings of the 58th annual meeting of the association for computational linguistics,4555–4567

ParaCrawl: web-scale acquisition of parallel corpora. InProceedings of the 58th annual meeting of the association for computational linguistics,4555–4567. Online: Association for Computational Linguistics, July. https://doi.org/10.18653/ v1/2020.acl-main.417. https://aclanthology.org/2020.acl-main.417. Bañón, Marta, Miquel Esplà-Gomis, Mikel L. Forcada, C...

work page doi:10.1145/2435215.2435218 2020
[2]

No Language Left Behind: Scaling Human-Centered Machine Translation

https://aclanthology.org/W16-2367. Esplà-Gomis, Miquel, and Mikel L. Forcada. 2010. Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with bitextor.The Prague Bulletin of Mathematical Linguistics,no. 93, 77–86. Facebook AI Research. 2017.Language identification with fasttext.https://f asttext.cc/docs/en/la...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/w19-5207 2010
[3]

MT5: a massively multilingual pre-trained text-to-text transformer. InProceedings of the 2021 conference of the north american chapter of the association for computational linguistics: human language technologies,edited by Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakrabort...

work page doi:10.18653/v1/2021.naacl-main.41 2021

[1] [1]

InProceedings of the 58th annual meeting of the association for computational linguistics,4555–4567

ParaCrawl: web-scale acquisition of parallel corpora. InProceedings of the 58th annual meeting of the association for computational linguistics,4555–4567. Online: Association for Computational Linguistics, July. https://doi.org/10.18653/ v1/2020.acl-main.417. https://aclanthology.org/2020.acl-main.417. Bañón, Marta, Miquel Esplà-Gomis, Mikel L. Forcada, C...

work page doi:10.1145/2435215.2435218 2020

[2] [2]

No Language Left Behind: Scaling Human-Centered Machine Translation

https://aclanthology.org/W16-2367. Esplà-Gomis, Miquel, and Mikel L. Forcada. 2010. Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with bitextor.The Prague Bulletin of Mathematical Linguistics,no. 93, 77–86. Facebook AI Research. 2017.Language identification with fasttext.https://f asttext.cc/docs/en/la...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/w19-5207 2010

[3] [3]

MT5: a massively multilingual pre-trained text-to-text transformer. InProceedings of the 2021 conference of the north american chapter of the association for computational linguistics: human language technologies,edited by Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakrabort...

work page doi:10.18653/v1/2021.naacl-main.41 2021