{"paper":{"title":"Unsupervised Cross-lingual Representation Learning at Scale","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Pretraining multilingual language models on 100 languages with over two terabytes of data leads to large gains on cross-lingual benchmarks.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Alexis Conneau, Edouard Grave, Francisco Guzm\\'an, Guillaume Wenzek, Kartikay Khandelwal, Luke Zettlemoyer, Myle Ott, Naman Goyal, Veselin Stoyanov, Vishrav Chaudhary","submitted_at":"2019-11-05T22:42:00Z","abstract_excerpt":"This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNL"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the observed gains are caused by the increased scale of pretraining data and languages rather than by differences in data filtering, hyperparameter choices, or evaluation protocol details not visible in the abstract.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"XLM-R, pretrained on 100 languages with 2TB of CommonCrawl data, improves average XNLI accuracy by 14.6 points and MLQA F1 by 13 points over mBERT while matching strong monolingual models on GLUE.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Pretraining multilingual language models on 100 languages with over two terabytes of data leads to large gains on cross-lingual benchmarks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"ef67d3374724e211b0fc5e8581c310acceec5c663fec28ffa6b47a67a67b4027"},"source":{"id":"1911.02116","kind":"arxiv","version":2},"verdict":{"id":"5d44db8b-ea6a-41b0-a378-480aa65173b2","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T16:18:54.635933Z","strongest_claim":"This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks.","one_line_summary":"XLM-R, pretrained on 100 languages with 2TB of CommonCrawl data, improves average XNLI accuracy by 14.6 points and MLQA F1 by 13 points over mBERT while matching strong monolingual models on GLUE.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the observed gains are caused by the increased scale of pretraining data and languages rather than by differences in data filtering, hyperparameter choices, or evaluation protocol details not visible in the abstract.","pith_extraction_headline":"Pretraining multilingual language models on 100 languages with over two terabytes of data leads to large gains on cross-lingual benchmarks."},"references":{"count":12,"sample":[{"doi":"","year":1907,"title":"Massively multilingual neural machine translation in the wild: Findings and challenges","work_id":"1f743ee4-68c2-4ada-b981-6f62054e2525","ref_index":1,"cited_arxiv_id":"1907.05019","is_internal_anchor":true},{"doi":"","year":2017,"title":"Bag of tricks for efﬁcient text classiﬁcation.EACL 2017, page","work_id":"194fdea9-8a27-4838-8c7b-97928d0f3ecb","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Exploring the limits of language modeling","work_id":"a9dbcb7a-e48d-42a4-8d60-a8f723751a97","ref_index":3,"cited_arxiv_id":"1602.02410","is_internal_anchor":true},{"doi":"","year":1910,"title":"arXiv preprint arXiv:1910.07475","work_id":"da4ff338-4ef3-4d81-bf0d-cb5b178e77df","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1907,"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","ref_index":5,"cited_arxiv_id":"1907.11692","is_internal_anchor":true}],"resolved_work":12,"snapshot_sha256":"15389c1c0e33c1bb9a3d62030e84c68491ebec291274483c3db97663d92cfc21","internal_anchors":6},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}