Recognition: unknown
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings
Pith reviewed 2026-05-14 20:13 UTC · model grok-4.3
The pith
Mixing high-resource language data outperforms hyperparameter tuning for low-resource pre-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In data-constrained pre-training, bilingual mixing of English data into Arabic training outperforms hyperparameter search, with benefits equivalent to 2-3 times more unique target data on validation loss and 2-13 times on downstream tasks. The advantage grows with model scale from 150M to 1.43B parameters. Target-language validation loss underestimates mixing's value since mixing adds knowledge beyond regularization from repeated data.
What carries the argument
The tunable mixing ratio between target Arabic and auxiliary English corpora, which expands the training distribution and diversifies the signal without needing more unique target data.
Load-bearing premise
English data supplies useful, non-conflicting signal for Arabic without introducing domain mismatch that would require separate controls.
What would settle it
A controlled run at 1B+ parameters where the best mixing ratio fails to beat the best hyperparameter-tuned monolingual baseline on downstream Arabic task accuracy.
read the original abstract
For most languages of the world, language model pre-training operates in a data-constrained regime where models must repeat their training data many times, degrading generalization. Two remedies exist: aggressive hyperparameter tuning such as high weight decay, and mixing in data from a high-resource auxiliary language to directly aid the low-resource target. While hyperparameter tuning regularizes the model by shrinking weights to restrict network capacity, auxiliary data mixing uses a tunable mixing ratio to expand the training distribution and diversify the training signal with new knowledge. Both offer a principled way to improve training in a data-constrained domain. We compare these levers systematically across four model scales from 150M to 1.43B parameters, using Arabic as the low-resource target and English as the auxiliary, over approximately 1000 pre-training runs. Three findings emerge. First, mixing yields larger improvements than hyperparameter tuning on both validation loss and downstream task accuracy, and the gap grows with model size. Second, we quantify how much mixing helps: it boosts performance by an amount equivalent to 2--3$\times$ the unique target data on validation loss and 2--13$\times$ on downstream task accuracy, with the gain scaling steeply with model size. Third, this divergence reveals that target-language validation loss systematically underestimates mixing's value. Mixing regularizes by diversifying the training signal and contributes knowledge the repeated target corpus cannot supply; validation loss captures only the first effect. Our practical recommendations are: mix in a high-resource language, prioritize the mixing ratio over hyperparameter tuning, and transfer hyperparameters from a small proxy model via $\mu$P.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that in data-constrained pre-training regimes, mixing auxiliary high-resource language data (English) into low-resource target data (Arabic) outperforms aggressive hyperparameter tuning on both validation loss and downstream task accuracy. This advantage grows with model scale (150M to 1.43B parameters) across ~1000 runs; mixing is quantified as equivalent to 2-3× unique target data on loss and 2-13× on downstream tasks. The paper concludes that target-language validation loss underestimates mixing benefits and recommends prioritizing mixing ratios while transferring other hyperparameters via μP from small proxies.
Significance. If the comparison between mixing and tuning is equitable, the result is practically significant for low-resource language modeling. It supplies concrete, scale-dependent quantifications of mixing gains in data-equivalent terms and identifies a systematic mismatch between validation loss and downstream utility. The large experimental budget (~1000 runs across four scales) and the explicit practical recommendations strengthen its utility for practitioners facing repeated-data regimes.
major comments (1)
- [Abstract / recommendations] Abstract and recommendations section: the central claim that mixing outperforms hyperparameter tuning (with gaps equivalent to 2-13× unique data, widening with scale) is load-bearing on the fairness of the tuning baseline. The manuscript recommends transferring hyperparameters via μP from small proxies, yet the ~1000-run budget description indicates mixing-ratio experiments dominate; if weight decay, learning rate, or schedule were not re-optimized at each scale with effort comparable to the mixing search, the reported ordering and magnitude are at risk of inflation.
minor comments (2)
- [Experimental details] The exact allocation of the ~1000 runs across scales, mixing ratios, and hyperparameter conditions should be summarized in a table to allow readers to assess search effort directly.
- [Evaluation] Clarify whether downstream task accuracy uses the same evaluation protocol and number of shots across all conditions to ensure the 2-13× equivalence claim is robust.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback emphasizing the need for an equitable baseline in our comparison of mixing versus hyperparameter tuning. We address the major comment directly below with clarifications on our experimental design and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Abstract / recommendations] Abstract and recommendations section: the central claim that mixing outperforms hyperparameter tuning (with gaps equivalent to 2-13× unique data, widening with scale) is load-bearing on the fairness of the tuning baseline. The manuscript recommends transferring hyperparameters via μP from small proxies, yet the ~1000-run budget description indicates mixing-ratio experiments dominate; if weight decay, learning rate, or schedule were not re-optimized at each scale with effort comparable to the mixing search, the reported ordering and magnitude are at risk of inflation.
Authors: We agree that the fairness of the hyperparameter tuning baseline is essential to the validity of our central claim. Our experimental protocol follows the μP framework by first conducting an extensive hyperparameter search—including learning rate, weight decay, and learning rate schedule—on small proxy models (150M parameters) across hundreds of runs. These optimized values were then transferred to larger scales (up to 1.43B) using μP scaling rules, with the ~1000 total runs primarily allocated to systematically varying mixing ratios at each scale while holding the transferred hyperparameters fixed. Limited spot-check re-optimizations were performed at intermediate scales to validate transfer quality, though a full grid search at every scale was not feasible under our compute budget. This mirrors standard practice in scaling-law studies. To directly address the concern, we will revise the paper to include: (1) an explicit breakdown of the experimental budget separating tuning runs from mixing-ratio runs, (2) additional details on the proxy-model search grid, and (3) new validation curves confirming that μP-transferred hyperparameters remain near-optimal at larger scales. These changes constitute a partial revision that clarifies but does not overturn the reported results. revision: partial
Circularity Check
No circularity: purely empirical comparison with no derivations or self-referential predictions
full rationale
The paper reports results from ~1000 direct pre-training runs comparing mixing ratios against hyperparameter tuning across four scales. All central claims (mixing outperforming tuning, quantified equivalents of 2-3x target data on loss and 2-13x on accuracy, scaling with model size) are experimental measurements, not outputs of equations that reduce to the inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. The μP transfer recommendation is a practical heuristic drawn from external prior work and does not tautologically force the reported gaps. The study is self-contained against its own benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- mixing ratio
axioms (1)
- domain assumption English data supplies useful diverse knowledge for Arabic without harmful domain interference
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2311.09205. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge,
-
[2]
URLhttps://arxiv.org/abs/1911.02116. Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. How abilities in large language models are affected by supervised fine-tuning data composition,
-
[3]
URLhttps://arxiv.org/abs/2310.05492. Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Luca Soldaini, Alexandros G. Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Ca...
-
[4]
URLhttps://arxiv.org/abs/2403.08540. Yifei He, Alon Benhaim, Barun Patra, Praneetha Vaddamanu, Sanchit Ahuja, Parul Chopra, Vishrav Chaudhary, Han Zhao, and Xia Song. Scaling laws for multilingual language models,
-
[5]
URLhttps://arxiv.org/abs/2410.12883. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding,
-
[6]
URLhttps://arxiv.org/abs/2009.03300. Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, Scott Johnston, Ben Mann, Chris Olah, Catherine Olsson, Dario Amodei, Nicholas Joseph, Jared Kaplan, and Sam McCandlish. Scaling laws and interpretability of learning from r...
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[7]
arXiv preprint arXiv:2205.10487 , year=
URLhttps://arxiv.org/abs/2205.10487. Ferenc Huszár. How (not) to train your generative model: Scheduled sampling, likelihood, adversary?,
-
[8]
How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary?
URL https://arxiv.org/abs/1511.05101. Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. An efficient approach for assessing hyperparameter importance. In Eric P. Xing and Tony Jebara, editors,Proceedings of the 31st International Conference on Machine Learning, volume 32 ofProceedings of Machine Learning Research, pages 754–762, Bejing, China, 22–24 Jun
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
arXiv preprint arXiv:2405.18392 , year=
URLhttps://arxiv.org/abs/2405.18392. Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. Pre-training under infinite compute,
-
[10]
URL https://arxiv.org/abs/2509.14786. Xize Liang, Lin Yang, Jie Wang, Yiyang Lu, Runyu Wu, Hanzhu Chen, and Jianye HAO. Boosting multi-domain fine-tuning of large language models through evolving interactions between samples. InForty-second International Conference on Machine Learning,
-
[11]
URLhttps://arxiv.org/abs/2401.13303. Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, and Xian Li. Few-...
-
[12]
URLhttps://arxiv.org/abs/2112.10668. Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. Regmix: Data mixture as regression for language model pre-training,
-
[13]
arXiv preprint arXiv:2407.01492 , year=
URLhttps://arxiv.org/abs/2407.01492. Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell, Alex Pentland, Sercan Arik, Chen-Yu Lee, and Sayna Ebrahimi. Atlas: Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality,
-
[14]
Niklas Muennighoff, Alexander M
URLhttps://arxiv.org/abs/2510.22037. Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models,
-
[15]
Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan
URLhttps://arxiv.org/abs/ 2305.16264. Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep generative models know what they don’t know?,
-
[16]
Do Deep Generative Models Know What They Don't Know?
URLhttps://arxiv.org/abs/1810.09136. Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
URL https://arxiv.org/abs/2406.17557. Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. Fineweb2: One pipeline to scale them all – adapting pre-training data processing to every language,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
arXiv preprint arXiv:2506.20920 , year=
URLhttps://arxiv.org/abs/2506.20920. Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, and Yair Carmon. Resolving discrepancies in compute-optimal scaling of language models,
-
[19]
arXiv preprint arXiv:2406.19146 , year=
URLhttps://arxiv.org/abs/2406.19146. Philipp Probst, Bernd Bischl, and Anne-Laure Boulesteix. Tunability: Importance of hyperparameters of machine learning algorithms,
-
[20]
Tunability: Importance of Hyperparameters of Machine Learning Algorithms
URLhttps://arxiv.org/abs/1802.09596. Skyler Seto, Maartje ter Hoeve, Richard He Bai, Natalie Schluter, and David Grangier. Training bilingual lms with data constraints in the targeted language,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
URLhttps://arxiv.org/abs/2411.12986. Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, and Pierre Ablin. Scaling laws for optimal data mixtures,
-
[22]
Scaling laws for optimal data mixtures,
URLhttps://arxiv.org/abs/2507.09404. NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Young- blood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jar...
-
[23]
No Language Left Behind: Scaling Human-Centered Machine Translation
URLhttps://arxiv.org/abs/2207.04672. Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
A note on the evaluation of generative models
URL https://arxiv.org/abs/1511.01844. Shuhei Watanabe, Archit Bansal, and Frank Hutter. Ped-anova: Efficiently quantifying hyperparameter importance in arbitrary subspaces,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
11 Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V
URLhttps://arxiv.org/abs/2304.10255. 11 Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining,
-
[26]
URL https://arxiv.org/abs/2305.10429. Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer,
-
[27]
URLhttps://arxiv.org/abs/2203.03466. Kazuki Yano, Shun Kiyono, Sosuke Kobayashi, Sho Takase, and Jun Suzuki. Pre-training llm without learning rate decay enhances supervised fine-tuning,
-
[28]
Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu
URLhttps://arxiv.org/abs/2603.16127. Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance,
-
[29]
arXiv preprint arXiv:2403.16952 , year=
URLhttps://arxiv.org/abs/2403.16952. A Learning rate schedule All experiments use a constant learning rate with no warmup and no decay. Constant or near-constant schedules are established practice in recent pre-training work: Hägele et al. (2024) show that constant LR with a short cooldown matches cosine at matched compute; Porian et al. (2025) reproduce ...
-
[30]
Table 2 summarizes the architecture at each scale
with a 256K vocabulary. Table 2 summarizes the architecture at each scale. Table 2Model architectures. All models use a head dimension of 128 and a context length of 2048 tokens. 150M 380M 600M 1.43B Layers 6 8 12 16 Hidden dim 512 1024 1280 2048 Attn heads 4 8 10 16 KV heads 4 8 10 8 FFN dim 1792 3584 4480 7168 Parameters 154M 384M 613M 1.43B Optimizer.A...
2048
-
[31]
E ANOVA methodology Throughout this paper, we use analysis of variance (ANOVA) to measure how much of the total spread in Arabic validation loss is attributable to each tuning axis. Variance-based HP sensitivity analysis has substantial precedent in the HPO literature (Hutter et al., 2014; Probst et al., 2018; Watanabe et al., 2023); we use classical ANOV...
2014
-
[32]
translationese
On the full grid the ratio drops at 1.43B; after re-centering, dominance is stable at∼5×and is largest at 1.43B. ρ(N;T) ModelRange D RangeHP (full) Full gridT=10%(n HP)T=20%(n HP) 150M 1.71 0.96 1.8×5.1×(20) 5.1×(20) 380M 1.59 0.45 3.6×5.1×(22) 3.6×(25) 600M 1.66 0.31 5.4×5.4×(25) 5.4×(25) 1.43B 1.90 0.97 2.0×6.7×(15) 5.0×(24) On the full grid,ρrises from...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.