arxiv: 2605.13225 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: unknown

Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings

Paul Jeha , Anastasiia Sedova , Louis B\'ethune , Skyler Seto , Jes Frellsen , Pierre Ablin , Natalie Schluter

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:13 UTC · model grok-4.3

classification 💻 cs.LG

keywords bilingual pre-trainingdata mixinglow-resource languageshyperparameter tuninglanguage modelsArabicEnglishdata-constrained training

0 comments

The pith

Mixing high-resource language data outperforms hyperparameter tuning for low-resource pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two ways to improve language model pre-training when target-language data must be repeated many times: aggressive hyperparameter tuning versus mixing in data from a high-resource auxiliary language. With Arabic as the low-resource target and English as the auxiliary, experiments across four model scales from 150M to 1.43B parameters show that mixing produces larger gains on validation loss and downstream accuracy. These gains equal the effect of 2-3 times more unique target data on loss and 2-13 times on tasks, and the advantage grows steeply with model size. This matters for the many languages where data scarcity forces repetition and hurts generalization.

Core claim

In data-constrained pre-training, bilingual mixing of English data into Arabic training outperforms hyperparameter search, with benefits equivalent to 2-3 times more unique target data on validation loss and 2-13 times on downstream tasks. The advantage grows with model scale from 150M to 1.43B parameters. Target-language validation loss underestimates mixing's value since mixing adds knowledge beyond regularization from repeated data.

What carries the argument

The tunable mixing ratio between target Arabic and auxiliary English corpora, which expands the training distribution and diversifies the signal without needing more unique target data.

Load-bearing premise

English data supplies useful, non-conflicting signal for Arabic without introducing domain mismatch that would require separate controls.

What would settle it

A controlled run at 1B+ parameters where the best mixing ratio fails to beat the best hyperparameter-tuned monolingual baseline on downstream Arabic task accuracy.

read the original abstract

For most languages of the world, language model pre-training operates in a data-constrained regime where models must repeat their training data many times, degrading generalization. Two remedies exist: aggressive hyperparameter tuning such as high weight decay, and mixing in data from a high-resource auxiliary language to directly aid the low-resource target. While hyperparameter tuning regularizes the model by shrinking weights to restrict network capacity, auxiliary data mixing uses a tunable mixing ratio to expand the training distribution and diversify the training signal with new knowledge. Both offer a principled way to improve training in a data-constrained domain. We compare these levers systematically across four model scales from 150M to 1.43B parameters, using Arabic as the low-resource target and English as the auxiliary, over approximately 1000 pre-training runs. Three findings emerge. First, mixing yields larger improvements than hyperparameter tuning on both validation loss and downstream task accuracy, and the gap grows with model size. Second, we quantify how much mixing helps: it boosts performance by an amount equivalent to 2--3$\times$ the unique target data on validation loss and 2--13$\times$ on downstream task accuracy, with the gain scaling steeply with model size. Third, this divergence reveals that target-language validation loss systematically underestimates mixing's value. Mixing regularizes by diversifying the training signal and contributes knowledge the repeated target corpus cannot supply; validation loss captures only the first effect. Our practical recommendations are: mix in a high-resource language, prioritize the mixing ratio over hyperparameter tuning, and transfer hyperparameters from a small proxy model via $\mu$P.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mixing English into Arabic pretraining beats hyperparameter tuning by a margin that grows with scale and equals several times more target data, though the tuning baseline may not have been optimized at full scale.

read the letter

The core result is that mixing high-resource English data during Arabic pretraining improves both validation loss and downstream accuracy more than hyperparameter tuning, with the advantage widening from 150M to 1.43B models. They quantify the mixing benefit as roughly 2-3x extra unique Arabic tokens on loss and up to 13x on tasks, and note that validation loss misses part of the value because mixing adds knowledge the repeated target data cannot supply. That quantification and the scale-dependent pattern are the genuinely new pieces beyond earlier separate studies of mixing or regularization.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that in data-constrained pre-training regimes, mixing auxiliary high-resource language data (English) into low-resource target data (Arabic) outperforms aggressive hyperparameter tuning on both validation loss and downstream task accuracy. This advantage grows with model scale (150M to 1.43B parameters) across ~1000 runs; mixing is quantified as equivalent to 2-3× unique target data on loss and 2-13× on downstream tasks. The paper concludes that target-language validation loss underestimates mixing benefits and recommends prioritizing mixing ratios while transferring other hyperparameters via μP from small proxies.

Significance. If the comparison between mixing and tuning is equitable, the result is practically significant for low-resource language modeling. It supplies concrete, scale-dependent quantifications of mixing gains in data-equivalent terms and identifies a systematic mismatch between validation loss and downstream utility. The large experimental budget (~1000 runs across four scales) and the explicit practical recommendations strengthen its utility for practitioners facing repeated-data regimes.

major comments (1)

[Abstract / recommendations] Abstract and recommendations section: the central claim that mixing outperforms hyperparameter tuning (with gaps equivalent to 2-13× unique data, widening with scale) is load-bearing on the fairness of the tuning baseline. The manuscript recommends transferring hyperparameters via μP from small proxies, yet the ~1000-run budget description indicates mixing-ratio experiments dominate; if weight decay, learning rate, or schedule were not re-optimized at each scale with effort comparable to the mixing search, the reported ordering and magnitude are at risk of inflation.

minor comments (2)

[Experimental details] The exact allocation of the ~1000 runs across scales, mixing ratios, and hyperparameter conditions should be summarized in a table to allow readers to assess search effort directly.
[Evaluation] Clarify whether downstream task accuracy uses the same evaluation protocol and number of shots across all conditions to ensure the 2-13× equivalence claim is robust.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for an equitable baseline in our comparison of mixing versus hyperparameter tuning. We address the major comment directly below with clarifications on our experimental design and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract / recommendations] Abstract and recommendations section: the central claim that mixing outperforms hyperparameter tuning (with gaps equivalent to 2-13× unique data, widening with scale) is load-bearing on the fairness of the tuning baseline. The manuscript recommends transferring hyperparameters via μP from small proxies, yet the ~1000-run budget description indicates mixing-ratio experiments dominate; if weight decay, learning rate, or schedule were not re-optimized at each scale with effort comparable to the mixing search, the reported ordering and magnitude are at risk of inflation.

Authors: We agree that the fairness of the hyperparameter tuning baseline is essential to the validity of our central claim. Our experimental protocol follows the μP framework by first conducting an extensive hyperparameter search—including learning rate, weight decay, and learning rate schedule—on small proxy models (150M parameters) across hundreds of runs. These optimized values were then transferred to larger scales (up to 1.43B) using μP scaling rules, with the ~1000 total runs primarily allocated to systematically varying mixing ratios at each scale while holding the transferred hyperparameters fixed. Limited spot-check re-optimizations were performed at intermediate scales to validate transfer quality, though a full grid search at every scale was not feasible under our compute budget. This mirrors standard practice in scaling-law studies. To directly address the concern, we will revise the paper to include: (1) an explicit breakdown of the experimental budget separating tuning runs from mixing-ratio runs, (2) additional details on the proxy-model search grid, and (3) new validation curves confirming that μP-transferred hyperparameters remain near-optimal at larger scales. These changes constitute a partial revision that clarifies but does not overturn the reported results. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential predictions

full rationale

The paper reports results from ~1000 direct pre-training runs comparing mixing ratios against hyperparameter tuning across four scales. All central claims (mixing outperforming tuning, quantified equivalents of 2-3x target data on loss and 2-13x on accuracy, scaling with model size) are experimental measurements, not outputs of equations that reduce to the inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. The μP transfer recommendation is a practical heuristic drawn from external prior work and does not tautologically force the reported gaps. The study is self-contained against its own benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work is an empirical study whose central claims rest on experimental design choices rather than mathematical axioms or new entities.

free parameters (1)

mixing ratio
Tunable fraction of auxiliary English data mixed into Arabic training; chosen to optimize performance.

axioms (1)

domain assumption English data supplies useful diverse knowledge for Arabic without harmful domain interference
Invoked when claiming mixing expands the training distribution productively.

pith-pipeline@v0.9.0 · 5624 in / 1282 out tokens · 47381 ms · 2026-05-14T20:13:37.436444+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 29 canonical work pages · 7 internal anchors

[1]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord

URLhttps://arxiv.org/abs/2311.09205. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge,

work page arXiv
[2]

Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou

URLhttps://arxiv.org/abs/1911.02116. Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. How abilities in large language models are affected by supervised fine-tuning data composition,

work page arXiv 1911
[3]

URLhttps://arxiv.org/abs/2310.05492. Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Luca Soldaini, Alexandros G. Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Ca...

work page arXiv
[4]

11 Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Hee- woo Jun, Tom B

URLhttps://arxiv.org/abs/2403.08540. Yifei He, Alon Benhaim, Barun Patra, Praneetha Vaddamanu, Sanchit Ahuja, Parul Chopra, Vishrav Chaudhary, Han Zhao, and Xia Song. Scaling laws for multilingual language models,

work page arXiv
[5]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt

URLhttps://arxiv.org/abs/2410.12883. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding,

work page arXiv
[6]

URLhttps://arxiv.org/abs/2009.03300. Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, Scott Johnston, Ben Mann, Chris Olah, Catherine Olsson, Dario Amodei, Nicholas Joseph, Jared Kaplan, and Sam McCandlish. Scaling laws and interpretability of learning from r...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[7]

arXiv preprint arXiv:2205.10487 , year=

URLhttps://arxiv.org/abs/2205.10487. Ferenc Huszár. How (not) to train your generative model: Scheduled sampling, likelihood, adversary?,

work page arXiv
[8]

How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary?

URL https://arxiv.org/abs/1511.05101. Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. An efficient approach for assessing hyperparameter importance. In Eric P. Xing and Tony Jebara, editors,Proceedings of the 31st International Conference on Machine Learning, volume 32 ofProceedings of Machine Learning Research, pages 754–762, Bejing, China, 22–24 Jun

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2405.18392 , year=

URLhttps://arxiv.org/abs/2405.18392. Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. Pre-training under infinite compute,

work page arXiv
[10]

& Hashimoto, T

URL https://arxiv.org/abs/2509.14786. Xize Liang, Lin Yang, Jie Wang, Yiyang Lu, Runyu Wu, Hanzhu Chen, and Jianye HAO. Boosting multi-domain fine-tuning of large language models through evolving interactions between samples. InForty-second International Conference on Machine Learning,

work page arXiv
[11]

URLhttps://arxiv.org/abs/2401.13303. Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, and Xian Li. Few-...

work page arXiv
[12]

Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin

URLhttps://arxiv.org/abs/2112.10668. Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. Regmix: Data mixture as regression for language model pre-training,

work page arXiv
[13]

arXiv preprint arXiv:2407.01492 , year=

URLhttps://arxiv.org/abs/2407.01492. Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell, Alex Pentland, Sercan Arik, Chen-Yu Lee, and Sayna Ebrahimi. Atlas: Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality,

work page arXiv
[14]

Niklas Muennighoff, Alexander M

URLhttps://arxiv.org/abs/2510.22037. Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models,

work page arXiv
[15]

Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan

URLhttps://arxiv.org/abs/ 2305.16264. Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep generative models know what they don’t know?,

work page arXiv
[16]

Do Deep Generative Models Know What They Don't Know?

URLhttps://arxiv.org/abs/1810.09136. Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

URL https://arxiv.org/abs/2406.17557. Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. Fineweb2: One pipeline to scale them all – adapting pre-training data processing to every language,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

arXiv preprint arXiv:2506.20920 , year=

URLhttps://arxiv.org/abs/2506.20920. Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, and Yair Carmon. Resolving discrepancies in compute-optimal scaling of language models,

work page arXiv
[19]

arXiv preprint arXiv:2406.19146 , year=

URLhttps://arxiv.org/abs/2406.19146. Philipp Probst, Bernd Bischl, and Anne-Laure Boulesteix. Tunability: Importance of hyperparameters of machine learning algorithms,

work page arXiv
[20]

Tunability: Importance of Hyperparameters of Machine Learning Algorithms

URLhttps://arxiv.org/abs/1802.09596. Skyler Seto, Maartje ter Hoeve, Richard He Bai, Natalie Schluter, and David Grangier. Training bilingual lms with data constraints in the targeted language,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, and Pierre Ablin

URLhttps://arxiv.org/abs/2411.12986. Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, and Pierre Ablin. Scaling laws for optimal data mixtures,

work page arXiv
[22]

Scaling laws for optimal data mixtures,

URLhttps://arxiv.org/abs/2507.09404. NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Young- blood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jar...

work page arXiv
[23]

No Language Left Behind: Scaling Human-Centered Machine Translation

URLhttps://arxiv.org/abs/2207.04672. Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

A note on the evaluation of generative models

URL https://arxiv.org/abs/1511.01844. Shuhei Watanabe, Archit Bansal, and Frank Hutter. Ped-anova: Efficiently quantifying hyperparameter importance in arbitrary subspaces,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

11 Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V

URLhttps://arxiv.org/abs/2304.10255. 11 Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining,

work page arXiv
[26]

Greg Yang, Edward J

URL https://arxiv.org/abs/2305.10429. Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer,

work page arXiv
[27]

arXiv:2203.03466 , year=

URLhttps://arxiv.org/abs/2203.03466. Kazuki Yano, Shun Kiyono, Sosuke Kobayashi, Sho Takase, and Jun Suzuki. Pre-training llm without learning rate decay enhances supervised fine-tuning,

work page arXiv
[28]

Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu

URLhttps://arxiv.org/abs/2603.16127. Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance,

work page arXiv
[29]

arXiv preprint arXiv:2403.16952 , year=

URLhttps://arxiv.org/abs/2403.16952. A Learning rate schedule All experiments use a constant learning rate with no warmup and no decay. Constant or near-constant schedules are established practice in recent pre-training work: Hägele et al. (2024) show that constant LR with a short cooldown matches cosine at matched compute; Porian et al. (2025) reproduce ...

work page arXiv 2024
[30]

Table 2 summarizes the architecture at each scale

with a 256K vocabulary. Table 2 summarizes the architecture at each scale. Table 2Model architectures. All models use a head dimension of 128 and a context length of 2048 tokens. 150M 380M 600M 1.43B Layers 6 8 12 16 Hidden dim 512 1024 1280 2048 Attn heads 4 8 10 16 KV heads 4 8 10 8 FFN dim 1792 3584 4480 7168 Parameters 154M 384M 613M 1.43B Optimizer.A...

2048
[31]

E ANOVA methodology Throughout this paper, we use analysis of variance (ANOVA) to measure how much of the total spread in Arabic validation loss is attributable to each tuning axis. Variance-based HP sensitivity analysis has substantial precedent in the HPO literature (Hutter et al., 2014; Probst et al., 2018; Watanabe et al., 2023); we use classical ANOV...

2014
[32]

translationese

On the full grid the ratio drops at 1.43B; after re-centering, dominance is stable at∼5×and is largest at 1.43B. ρ(N;T) ModelRange D RangeHP (full) Full gridT=10%(n HP)T=20%(n HP) 150M 1.71 0.96 1.8×5.1×(20) 5.1×(20) 380M 1.59 0.45 3.6×5.1×(22) 3.6×(25) 600M 1.66 0.31 5.4×5.4×(25) 5.4×(25) 1.43B 1.90 0.97 2.0×6.7×(15) 5.0×(24) On the full grid,ρrises from...

2026