pith:2VSLHXZL
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings
Mixing high-resource language data outperforms hyperparameter tuning for low-resource pre-training.
arxiv:2605.13225 v1 · 2026-05-13 · cs.LG
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{2VSLHXZLA4T3HSYTCPVSDWTGDX}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
mixing yields larger improvements than hyperparameter tuning on both validation loss and downstream task accuracy, and the gap grows with model size. We quantify how much mixing helps: it boosts performance by an amount equivalent to 2--3× the unique target data on validation loss and 2--13× on downstream task accuracy, with the gain scaling steeply with model size.
That the chosen mixing ratios are near-optimal and that English data supplies useful, non-conflicting signal for Arabic without introducing domain mismatch that would require separate controls.
Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
References
Formal links
Receipt and verification
| First computed | 2026-05-18T02:44:49.635911Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
d564b3df2b0727b3cb1313eb21da661de40c6207e186cadfeb0f3b59d4385ca8
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/2VSLHXZLA4T3HSYTCPVSDWTGDX \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: d564b3df2b0727b3cb1313eb21da661de40c6207e186cadfeb0f3b59d4385ca8
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "f611460a07c52135d26e5d0aa86bf5d2c0167ea58dbd4c572cd6c471189765f1",
"cross_cats_sorted": [],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.LG",
"submitted_at": "2026-05-13T09:17:51Z",
"title_canon_sha256": "d1f030a3df4a94c573b01b50ee1b517f6181a1e68243d22338561604cda508a0"
},
"schema_version": "1.0",
"source": {
"id": "2605.13225",
"kind": "arxiv",
"version": 1
}
}