arxiv: 1911.02116 · v2 · submitted 2019-11-05 · 💻 cs.CL

Recognition: no theorem link

Unsupervised Cross-lingual Representation Learning at Scale

Alexis Conneau , Kartikay Khandelwal , Naman Goyal , Vishrav Chaudhary , Guillaume Wenzek , Francisco Guzm\'an , Edouard Grave , Myle Ott

show 2 more authors

Luke Zettlemoyer Veselin Stoyanov

Authors on Pith no claims yet

Pith reviewed 2026-05-16 16:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual language modelscross-lingual transfermasked language modelinglow-resource languagesXLM-RCommonCrawl data

0 comments

The pith

Pretraining multilingual language models on 100 languages with over two terabytes of data leads to large gains on cross-lingual benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that scaling pretraining of a Transformer masked language model to 100 languages using more than two terabytes of filtered CommonCrawl data produces substantial improvements on cross-lingual transfer tasks. XLM-R outperforms mBERT by 14.6 points average accuracy on XNLI, 13 points F1 on MLQA, and 2.4 points F1 on NER, with the largest gains on low-resource languages such as Swahili and Urdu. Readers would care because the results indicate that increased scale can deliver positive transfer across languages while keeping high-resource performance intact and matching strong monolingual models on benchmarks like GLUE.

Core claim

XLM-R, trained as a masked language model on one hundred languages with more than two terabytes of filtered CommonCrawl data, significantly outperforms mBERT on cross-lingual benchmarks including +14.6% average accuracy on XNLI, +13% average F1 on MLQA, and +2.4% F1 on NER, with larger improvements for low-resource languages, while remaining competitive with monolingual models on GLUE and XNLI.

What carries the argument

The Transformer-based masked language model pretrained at scale on filtered CommonCrawl data from 100 languages, which manages the trade-off between positive transfer and capacity dilution.

Load-bearing premise

The performance gains are caused by the increased scale of pretraining data and languages rather than by differences in data filtering, hyperparameter choices, or evaluation protocol details.

What would settle it

A controlled retraining of mBERT on the exact same 2TB filtered CommonCrawl data from 100 languages to test whether the gains persist or disappear.

read the original abstract

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code, data and models publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

XLM-R gets solid cross-lingual gains from training on 100 languages and 2 TB of filtered CommonCrawl, but the causal credit to scale alone is not fully isolated from data source differences.

read the letter

XLM-R shows that scaling a masked language model to 100 languages with more than 2 TB of filtered CommonCrawl data produces clear lifts over mBERT on XNLI (+14.6% average accuracy), MLQA (+13% F1), and NER. The low-resource improvements, such as 15.7% on Swahili XNLI, stand out, and the model stays competitive with strong monolingual baselines on GLUE and XNLI. That last point is useful because it undercuts the usual worry that multilingual training must hurt per-language performance.

Referee Report

2 major / 2 minor

Summary. The manuscript presents XLM-R, a Transformer-based masked language model pretrained unsupervised on 100 languages using more than 2 TB of filtered CommonCrawl data. It reports large gains over mBERT on cross-lingual transfer benchmarks (+14.6% average accuracy on XNLI, +13% average F1 on MLQA, +2.4% F1 on NER), with especially strong improvements on low-resource languages (e.g., +15.7% XNLI for Swahili). The paper includes an empirical analysis of trade-offs between positive transfer and capacity dilution across resource levels and shows that XLM-R remains competitive with strong monolingual models on GLUE and XNLI.

Significance. If the results hold after controlling for corpus differences, the work would be significant for establishing that scaling both data volume and language coverage in multilingual pretraining produces broad, practically useful gains in cross-lingual transfer, especially for low-resource languages. The public release of code, data, and models would further increase its value as a reproducible baseline.

major comments (2)

[Abstract and empirical analysis section] Abstract and empirical analysis section: The central claim attributes the reported gains to pretraining 'at scale' (100 languages, >2 TB data), yet the comparison is to mBERT trained on Wikipedia; no controlled ablation is described that holds data source, filtering, and language balance fixed while varying only token count or number of languages. This leaves open whether the +14.6% XNLI and low-resource improvements are caused by scale or by differences in corpus quality and distribution.
[Experimental results section] Experimental results section: The headline deltas (e.g., +13% MLQA F1, +2.4% NER F1) are presented without error bars, number of runs, or statistical significance tests, making it impossible to assess whether the improvements are robust given the low experimental soundness noted in the review.

minor comments (2)

[Data section] The description of data filtering steps for CommonCrawl could be expanded with explicit criteria and language-specific statistics to aid reproducibility.
[Table 1] Table comparing XLM-R to prior models would benefit from an additional column reporting training data volume and number of languages for each baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below, clarifying our claims about scaling and the practical constraints on experimental reporting.

read point-by-point responses

Referee: [Abstract and empirical analysis section] Abstract and empirical analysis section: The central claim attributes the reported gains to pretraining 'at scale' (100 languages, >2 TB data), yet the comparison is to mBERT trained on Wikipedia; no controlled ablation is described that holds data source, filtering, and language balance fixed while varying only token count or number of languages. This leaves open whether the +14.6% XNLI and low-resource improvements are caused by scale or by differences in corpus quality and distribution.

Authors: We agree that a fully controlled ablation isolating token count while holding data source, filtering, and language balance exactly fixed would provide stronger causal evidence. Our empirical analysis section does examine trade-offs between positive transfer and capacity dilution by varying the number of languages (and thus effective capacity per language) while using the same CommonCrawl data, and we show consistent gains on low-resource languages as scale increases. However, we do not claim the gains are due solely to scale independent of corpus differences; mBERT is used as the standard public baseline. In revision we will add an explicit paragraph in the analysis section acknowledging the Wikipedia vs. CommonCrawl difference and noting that controlled ablations remain an important direction for future work. revision: partial
Referee: [Experimental results section] Experimental results section: The headline deltas (e.g., +13% MLQA F1, +2.4% NER F1) are presented without error bars, number of runs, or statistical significance tests, making it impossible to assess whether the improvements are robust given the low experimental soundness noted in the review.

Authors: We acknowledge that variance estimates would strengthen the presentation. Pretraining XLM-R required processing more than 2 TB of data on large GPU clusters; repeating the full pretraining multiple times to obtain error bars is computationally prohibitive. The reported improvements are large in magnitude and hold consistently across five diverse benchmarks (XNLI, MLQA, NER, GLUE, and XNLI per-language breakdowns). We will add a short statement in the experimental setup section explaining single-run reporting due to resource constraints and that results are corroborated by cross-task consistency. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark results independent of inputs

full rationale

The paper's central claim rests on training XLM-R on >2 TB filtered CommonCrawl data across 100 languages and reporting direct performance numbers on held-out benchmarks (XNLI +14.6%, MLQA +13%, NER +2.4%, plus low-resource gains). These are measured outcomes, not quantities defined in terms of fitted parameters or prior results inside the paper. No equations, ansatzes, or uniqueness theorems are invoked that reduce to self-citation or self-definition. The trade-off analysis between positive transfer and capacity dilution is presented via additional controlled experiments rather than by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical scaling of an existing masked-language-modeling recipe; the main free parameters are the chosen number of languages and data volume, while the core assumptions are inherited from prior BERT-style work.

free parameters (2)

Number of languages = 100
Chosen value of 100 languages to balance high- and low-resource coverage
Pretraining data volume = 2 terabytes
Filtered CommonCrawl corpus size used for training

axioms (1)

domain assumption Masked language modeling on multilingual text produces representations that transfer across languages
Invoked when claiming that pretraining on 100 languages will improve cross-lingual transfer

pith-pipeline@v0.9.0 · 5563 in / 1403 out tokens · 65471 ms · 2026-05-16T16:18:54.635933+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GAViD: A Large-Scale Multimodal Dataset for Context-Aware Group Affect Recognition from Videos
cs.CV 2026-04 unverdicted novelty 7.0

GAViD is a new multimodal video dataset for context-aware group affect recognition, with CAGNet reaching 63.20% test accuracy comparable to prior state-of-the-art.
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings
cs.LG 2026-05 conditional novelty 6.0

Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
cs.LG 2026-04 unverdicted novelty 6.0

COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset
cs.CL 2026-04 accept novelty 6.0

KS-PRET-5M is a newly released 5.09 million word Kashmiri pretraining dataset containing 12.13 million subword tokens after MuRIL tokenization, made available as a continuous text stream under CC BY 4.0.
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
cs.RO 2025-05 unverdicted novelty 6.0

UniVLA trains cross-embodiment vision-language-action policies from unlabeled videos via a latent action model in DINO space, beating OpenVLA on benchmarks with 1/20th pretraining compute and 1/10th downstream data.
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
cs.CL 2024-06 unverdicted novelty 6.0

FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
Unsupervised Dense Information Retrieval with Contrastive Learning
cs.IR 2021-12 unverdicted novelty 6.0

Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
cs.SE 2021-02 unverdicted novelty 6.0

CodeXGLUE supplies a standardized collection of 10 code-related tasks, 14 datasets, an evaluation platform, and BERT-, GPT-, and encoder-decoder-style baselines.
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
cs.SE 2026-05 unverdicted novelty 5.0

Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan
cs.CL 2026-05 unverdicted novelty 5.0

An interpretable deep learning framework with a new tokenizer is used to quantify how grammatical gender information is distributed between lemmas and sentential context during the Latin-to-Occitan transition.
Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF
cs.CL 2026-05 unverdicted novelty 5.0

A structured practicum guides readers through the complete modern NLP pipeline with reproducible sessions and new linguistic resources for Tajik and Tatar.
Automatic Reflection Level Classification in Hungarian Student Essays
cs.CL 2026-05 unverdicted novelty 5.0

Classical machine learning models outperform Hungarian transformers slightly in overall performance (71% vs 68% average score) for classifying reflection levels in student essays, though transformers handle rare class...
Multilingual Training and Evaluation Resources for Vision-Language Models
cs.CL 2026-04 conditional novelty 5.0

Releases regenerated multilingual training data and translated benchmarks for VLMs in five languages and demonstrates consistent benefits from multilingual training over English-only baselines.
Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance
cs.CL 2026-04 unverdicted novelty 5.0

A new pre-training task that maps languages bidirectionally in embedding space improves machine translation by up to 11.9 BLEU, cross-lingual QA by 6.72 BERTScore points, and understanding accuracy by over 5% over str...
'Layer su Layer': Identifying and Disambiguating the Italian NPN Construction in BERT's family
cs.CL 2026-04 unverdicted novelty 5.0

Layer-wise probing shows the degree to which Italian NPN constructions' form and meaning are reflected in BERT contextual embeddings.
VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering
cs.IR 2026-01 unverdicted novelty 5.0

VerifAI is an open-source biomedical QA system that decomposes generated answers into claims and verifies them with a fine-tuned NLI engine to reduce hallucinations and provide traceable citations.
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
cs.CL 2026-04 unverdicted novelty 4.0

Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance ...
Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF
cs.CL 2026-05 unverdicted novelty 2.0

The work provides a reproducible, session-based guide to the NLP pipeline with original adaptations and resources for morphologically rich low-resource languages.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 17 Pith papers · 6 internal anchors

[1]

Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges

Massively multilingual neural machine translation in the wild: Findings and chal- lenges. arXiv preprint arXiv:1907.05019. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning

work page internal anchor Pith review Pith/arXiv arXiv 1907
[2]

Bag of tricks for efﬁcient text classiﬁcation.EACL 2017, page

work page 2017
[3]

Exploring the Limits of Language Modeling

Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410. Taku Kudo

work page internal anchor Pith review Pith/arXiv arXiv
[4]

arXiv preprint arXiv:1910.07475

Mlqa: Eval- uating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov

work page arXiv 1910
[5]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Roberta: A robustly optimized BERT pretraining ap- proach. arXiv preprint arXiv:1907.11692. Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013a. Exploiting similarities among languages for ma- chine translation. arXiv preprint arXiv:1309.4168. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013b. Distributed representa- tions of...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[6]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Exploring the limits of transfer learning with a uniﬁed text-to-text trans- former. arXiv preprint arXiv:1910.10683. Pranav Rajpurkar, Robin Jia, and Percy Liang

work page internal anchor Pith review Pith/arXiv arXiv 1910
[7]

Introduction to the conll-2002 shared task: Language-independent named entity recognition. CoNLL. Tal Schuster, Ori Ram, Regina Barzilay, and Amir Globerson

work page 2002
[8]

XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering

Xlda: Cross-lingual data augmentation for natural lan- guage inference and question answering. arXiv preprint arXiv:1905.11471. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts

work page internal anchor Pith review Pith/arXiv arXiv 1905
[9]

In CoNLL, pages 142–147

In- troduction to the conll-2003 shared task: language- independent named entity recognition. In CoNLL, pages 142–147. Association for Computational Lin- guistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin

work page 2003
[10]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con- neau, Vishrav Chaudhary, Francisco Guzman, Ar- mand Joulin, and Edouard Grave

work page internal anchor Pith review Pith/arXiv arXiv
[11]

arXiv preprint arXiv:1911.00359

Ccnet: Ex- tracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359. Adina Williams, Nikita Nangia, and Samuel R Bow- man

work page arXiv 1911
[12]

arXiv preprint arXiv:1904.12848

Unsupervised data aug- mentation for consistency training. arXiv preprint arXiv:1904.12848. Appendix A Languages and statistics for CC-100 used by XLM-R In this section we present the list of languages in the CC-100 corpus we created for training XLM-R. We also report statistics such as the number of tokens and the size of each monolingual corpus. ISO cod...

work page arXiv 1904