Pith Number

pith:A536HLN7

pith:2019:A536HLN7UJUF763LWO53KFKHUV

not attested not anchored not stored refs resolved

Unsupervised Cross-lingual Representation Learning at Scale

Alexis Conneau, Edouard Grave, Francisco Guzm\'an, Guillaume Wenzek, Kartikay Khandelwal, Luke Zettlemoyer, Myle Ott, Naman Goyal, Veselin Stoyanov, Vishrav Chaudhary

Pretraining multilingual language models on 100 languages with over two terabytes of data leads to large gains on cross-lingual benchmarks.

arxiv:1911.02116 v2 · 2019-11-05 · cs.CL

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{A536HLN7UJUF763LWO53KFKHUV}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks.

C2weakest assumption

That the observed gains are caused by the increased scale of pretraining data and languages rather than by differences in data filtering, hyperparameter choices, or evaluation protocol details not visible in the abstract.

C3one line summary

XLM-R, pretrained on 100 languages with 2TB of CommonCrawl data, improves average XNLI accuracy by 14.6 points and MLQA F1 by 13 points over mBERT while matching strong monolingual models on GLUE.

References

12 extracted · 12 resolved · 6 Pith anchors

[1] Massively multilingual neural machine translation in the wild: Findings and challenges 1907 · arXiv:1907.05019

[2] Bag of tricks for efﬁcient text classiﬁcation.EACL 2017, page 2017

[3] Exploring the limits of language modeling · arXiv:1602.02410

[4] arXiv preprint arXiv:1910.07475 1910

[5] RoBERTa: A Robustly Optimized BERT Pretraining Approach 1907 · arXiv:1907.11692

Cited by

36 papers in Pith

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

BamiBERT: A New BERT-based Language Model for Vietnamese

SV-Detect: AI-generated Text Detection with Steering Vectors

When Meaning Travels: A Granular Lens on Hybrid-MoE's Role in Idiomatic Understanding for Language Models

The Tokenizer Tax Across 25 European Languages: Domain Invariance, Cross-Lingual Few-Shot Effects, and the Ukrainian Penalty

Receipt and verification

First computed	2026-05-17T23:38:47.315378Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

0777e3adbfa2685ffb6bb3bbb51547a555b8cee1800b2893176cd77160efda46

Aliases

arxiv: 1911.02116 · arxiv_version: 1911.02116v2 · doi: 10.48550/arxiv.1911.02116 · pith_short_12: A536HLN7UJUF · pith_short_16: A536HLN7UJUF763L · pith_short_8: A536HLN7

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/A536HLN7UJUF763LWO53KFKHUV \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 0777e3adbfa2685ffb6bb3bbb51547a555b8cee1800b2893176cd77160efda46

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "fbed3f020eb6d7cdcebb15ee2da04eef8c1db21877b60207edcbbd0b72267088",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2019-11-05T22:42:00Z",
    "title_canon_sha256": "f1c1e325d47d6ee88301d33ccbf5082b8804a1f894c8786e31e3003ca0f104c5"
  },
  "schema_version": "1.0",
  "source": {
    "id": "1911.02116",
    "kind": "arxiv",
    "version": 2
  }
}