pith:ERBMSKCV
Pretraining Language Models with Subword Regularization: An Empirical Study of BPE Dropout in Low-Resource NLP
Stochastic tokenization during both pretraining and fine-tuning yields the best results in low-resource NLP tasks.
arxiv:2605.13436 v1 · 2026-05-13 · cs.CL · cs.LG
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{ERBMSKCVHF5ETVX2DJW5NHOQD3}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Across tasks, the best results are typically obtained when stochastic tokenization is applied during both pretraining and fine-tuning, whereas applying BPE dropout only during fine-tuning can underperform deterministic tokenization in smaller-data settings.
That the downsampled subsets of high-resource languages and the chosen evaluation tasks sufficiently represent truly low-resource scenarios, and that the modest morphological alignment gains explain the performance benefits.
Stochastic tokenization with BPE dropout during both pretraining and fine-tuning outperforms deterministic tokenization or fine-tuning-only dropout on low-resource NLP tasks.
References
Receipt and verification
| First computed | 2026-05-18T02:44:47.101101Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
2442c92855397a49d6fa1a6dd69dd01ed208fffb1703e0dd7cb1e5486fb175bb
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/ERBMSKCVHF5ETVX2DJW5NHOQD3 \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 2442c92855397a49d6fa1a6dd69dd01ed208fffb1703e0dd7cb1e5486fb175bb
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "95cf7dcc087fa1067e2a6e12016390653396469c8ef71be08c5541b0a847e0e6",
"cross_cats_sorted": [
"cs.LG"
],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.CL",
"submitted_at": "2026-05-13T12:31:04Z",
"title_canon_sha256": "fe3fc4f0fe3e5638c42b357229d5dddc318e3b413ba0ca2cab87bff14dadd35a"
},
"schema_version": "1.0",
"source": {
"id": "2605.13436",
"kind": "arxiv",
"version": 1
}
}