arxiv: 2604.13286 · v1 · submitted 2026-04-14 · 💻 cs.CL · cs.AI

Recognition: unknown

English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training

Dezhi Hong, Mehak Dhaliwal, Shashwat Chaurasia, Thomas Butler, Yao Qin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multilingual LLMspost-trainingsupervised fine-tuningcross-lingual transferlanguage coveragemathematical reasoningAPI calling

0 comments

The pith

Adding languages during post-training improves LLM performance on English and other languages alike.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs 220 controlled fine-tuning experiments on parallel translated data covering mathematical reasoning and API calling tasks. It tests how language coverage interacts with model scale up to 8 billion parameters. Results show that broader language inclusion helps most tasks, with the largest gains for low-resource languages and no drop for high-resource ones. Even adding a single non-English language lifts English results and cross-lingual abilities. This indicates that English-only post-training leaves performance on the table.

Core claim

Through systematic experiments, the authors show that increasing language coverage during post-training is largely beneficial across tasks and model scales, with low-resource languages benefiting the most and high-resource languages plateauing rather than degrading. Even minimal multilinguality helps: incorporating a single non-English language improves both English performance and cross-lingual generalization, making English-only post-training largely suboptimal. Moreover, at sufficient language diversity, zero-shot cross-lingual transfer can match or exceed the effects of direct language inclusion in a low-diversity setting, although gains remain limited for typologically distant, low-ress

What carries the argument

The controlled variation of language coverage in parallel translated multilingual data mixtures for supervised fine-tuning on mathematical reasoning and API calling tasks.

If this is right

Low-resource languages receive the largest performance gains as more languages enter the post-training mix.
High-resource languages maintain stable performance rather than declining with added language diversity.
English task accuracy rises when even one non-English language is included in the data.
Zero-shot cross-lingual transfer becomes competitive with direct training once overall language diversity is high.
The benefits appear consistently across model sizes up to 8 billion parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Post-training pipelines could default to multilingual mixtures to raise overall capability without separate English-only runs.
Typologically distant languages may still need targeted methods beyond simple data inclusion to reach full performance.
Similar language-coverage effects might appear in other post-training stages such as preference tuning if tested the same way.

Load-bearing premise

The translated data mixtures preserve task meaning and difficulty across languages without introducing translation artifacts or quality loss.

What would settle it

Repeating the experiments on naturally occurring non-translated multilingual data and observing that performance stops improving or declines as language coverage increases.

Figures

Figures reproduced from arXiv: 2604.13286 by Dezhi Hong, Mehak Dhaliwal, Shashwat Chaurasia, Thomas Butler, Yao Qin.

**Figure 1.** Figure 1: Overview of the experimental design. We start from a task-specific multilingual parallel data pool consisting of five core languages, which are used to construct exhaustive data mixture combinations, and five additional languages that enable scaling the experiments to up to ten languages. We generate 22 data mixtures with increasing language counts and varying combinations; within the multilingual data mi… view at source ↗

**Figure 2.** Figure 2: Effect of increasing training language coverage on model performance for Qwen3 and Gemma-3 models. Plots show average accuracy (%) with 95% Wilson confidence intervals as a function of the number of training languages, grouped by high-resource and low-resource evaluation languages, for API calling (top) and math reasoning (bottom). languages), we evaluate multiple combinations of the core languages, as in… view at source ↗

**Figure 3.** Figure 3: Median accuracy (%) change from bilingual versus English-only post-training across evaluation settings for API calling and math reasoning. Error bars show 95% bootstrap confidence intervals. Bilingual training yields consistent gains across evaluation settings, with the largest improvements under direct exposure and smaller but reliable gains under cross-lingual transfer. high-resource languages plateauing… view at source ↗

**Figure 4.** Figure 4: Comparison of zero-shot cross-lingual transfer versus direct bilingual exposure at varying levels of language diversity for Qwen-3 8B (top) and Gemma-3 4B (bottom), with 4 (left), 6 (middle), and 9 (right) training languages. High-resource languages (red) tend to cluster near the diagonal, indicating strong zero-shot generalization that can compensate for the absence of direct inclusion. Low-resource langu… view at source ↗

**Figure 5.** Figure 5: Example showing the final turns of a parallel multi-turn API calling interaction [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt used to create mAPICall-Bank A mAPICall-Bank Construction We construct mAPICall-Bank by translating the API calling subset of the original English API-Bank dataset into multiple target languages using a state-of-the-art large language model. The full translation prompt is provided in [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Change in accuracy relative to the pre-trained baseline for bilingual versus English [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Per-language multilingual scaling trends for (a) Qwen-3 models, and (b) Gemma-3 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of zero-shot cross-lingual transfer and direct bilingual exposure for smaller models. Rows correspond to Qwen-3 1.7B (top), Qwen-3 0.6B (middle), and Gemma-3 1B (bottom), with columns showing 4 (left), 6 (middle), and 9 (right) training languages. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Despite the widespread multilingual deployment of large language models, post-training pipelines remain predominantly English-centric, contributing to performance disparities across languages. We present a systematic, controlled study of the interplay between training language coverage, model scale, and task domain, based on 220 supervised fine-tuning runs on parallel translated multilingual data mixtures spanning mathematical reasoning and API calling tasks, with models up to 8B parameters. We find that increasing language coverage during post-training is largely beneficial across tasks and model scales, with low-resource languages benefiting the most and high-resource languages plateauing rather than degrading. Even minimal multilinguality helps: incorporating a single non-English language improves both English performance and cross-lingual generalization, making English-only post-training largely suboptimal. Moreover, at sufficient language diversity, zero-shot cross-lingual transfer can match or exceed the effects of direct language inclusion in a low-diversity setting, although gains remain limited for typologically distant, low-resource languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 220-run study finds that adding even one non-English language to post-training improves English performance and cross-lingual transfer on math and API tasks, but the results depend on unverified translation quality.

read the letter

The main takeaway here is that post-training LLMs on a mix that includes even one non-English language beats English-only training for both English task performance and cross-lingual transfer, at least on the math reasoning and API calling tasks they tested. The paper runs a controlled comparison with 220 supervised fine-tuning experiments. They vary the number of languages in the data mixture, model size up to 8B, and look at two domains. This setup lets them track how language coverage affects outcomes without too many other moving parts. They report that low-resource languages see the biggest lifts, high-resource ones plateau, and that zero-shot transfer can sometimes match adding the language directly when diversity is high enough. The strength is the sheer number of runs and the focus on post-training rather than pretraining. It gives concrete numbers on when English-only is leaving performance on the table. The weak point is the reliance on parallel translations of the original English data. Math reasoning and API calling are sensitive to exact wording and structure. If the translations introduce errors or change the difficulty, that could explain some of the patterns they see, especially the gains for low-resource languages. The abstract gives no sign they checked translation accuracy, controlled for total data volume, or ran significance tests. Those details matter because the headline claims depend on the mixtures being clean. This is for researchers who fine-tune models for multilingual use or who want to move beyond English-centric pipelines. Someone working on practical improvements to LLM training would find the scale of the study worth looking at. I would send it to peer review. The experimental volume makes it worth a closer look, even though the methods section will need to address the translation and control issues head-on.

Referee Report

1 major / 2 minor

Summary. The paper reports a controlled empirical study of 220 supervised fine-tuning runs on models up to 8B parameters, using parallel translated mixtures of mathematical reasoning and API calling tasks. It claims that increasing language coverage during post-training is largely beneficial, with low-resource languages gaining the most and high-resource languages plateauing; that even a single non-English language improves both English performance and cross-lingual generalization; and that, at sufficient diversity, zero-shot transfer can match direct inclusion.

Significance. If the translation fidelity holds, the scale of the controlled experiments (220 runs across scales and tasks) and the finding that minimal multilinguality improves English performance would provide actionable guidance against English-only post-training pipelines. The systematic variation of language coverage, model size, and task domain strengthens the potential contribution to multilingual LLM training practices.

major comments (1)

[Experimental setup / data mixtures] The description of the parallel translated multilingual data mixtures (abstract and experimental setup): the headline claims rest on translated versions of the original English problems for both mathematical reasoning and API calling. No back-translation accuracy, human task-fidelity checks, or controls isolating translation artifacts from genuine language-coverage effects are reported. Because translation errors in numbers, operators, or API descriptions would directly alter task difficulty or functional equivalence, gains attributed to multilinguality could instead reflect increased token volume, regularization from noise, or data-quality artifacts; this assumption is load-bearing for the central empirical result.

minor comments (2)

[Abstract / Results] The abstract states that the study comprises 220 controlled runs but provides no visible details on statistical significance testing, variance across seeds, or exact data-mixture token counts per language; adding these would strengthen the reported improvements.
[Results] The paper would benefit from an explicit error analysis or qualitative examples showing how performance changes with added languages, particularly for the typologically distant low-resource languages where gains remain limited.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying a key assumption in our experimental design. We address the concern regarding translation fidelity below and outline concrete revisions.

read point-by-point responses

Referee: [Experimental setup / data mixtures] The description of the parallel translated multilingual data mixtures (abstract and experimental setup): the headline claims rest on translated versions of the original English problems for both mathematical reasoning and API calling. No back-translation accuracy, human task-fidelity checks, or controls isolating translation artifacts from genuine language-coverage effects are reported. Because translation errors in numbers, operators, or API descriptions would directly alter task difficulty or functional equivalence, gains attributed to multilinguality could instead reflect increased token volume, regularization from noise, or data-quality artifacts; this assumption is load-bearing for the central empirical result.

Authors: We agree that translation quality is a load-bearing assumption and that the absence of explicit fidelity checks leaves open the possibility of confounding artifacts. In the original experiments the data were produced via a high-quality neural machine translation pipeline applied to the English sources, but we did not quantify fidelity or run controls. In the revised manuscript we will add a dedicated subsection that reports: (i) back-translation BLEU scores on held-out samples for both the mathematical-reasoning and API-calling mixtures, (ii) human evaluation of task equivalence on a stratified sample of 200 instances (100 per task) performed by native speakers, and (iii) a control condition in which the English-only training set is augmented with synthetic noise and extra tokens to match the volume and approximate noise characteristics of the multilingual mixtures. These additions will allow readers to assess whether the reported gains are driven by language coverage rather than data artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements from controlled SFT runs

full rationale

The paper presents results exclusively from 220 supervised fine-tuning experiments on parallel translated data mixtures for math reasoning and API calling tasks. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness theorems appear in the abstract or described methodology. All headline claims (benefits of language coverage, minimal multilinguality effects, zero-shot transfer) are stated as direct experimental outcomes rather than reductions to prior inputs or self-referential definitions. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical paper with no free parameters or invented entities; central claims rest on standard domain assumptions about data translation fidelity and task representativeness in LLM fine-tuning.

axioms (1)

domain assumption Parallel translated data mixtures accurately preserve task semantics and difficulty across languages.
Required for the controlled comparison of language coverage effects to be valid.

pith-pipeline@v0.9.0 · 5478 in / 1319 out tokens · 52675 ms · 2026-05-10T15:02:21.727486+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 10 canonical work pages · 2 internal anchors

[1]

When is multilinguality a curse? language modeling for 250 high-and low-resource languages

Tyler A Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen. When is multilinguality a curse? language modeling for 250 high-and low-resource languages. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 4074–4096,

2024
[2]

Monolingual or multilingual instruction tuning: Which makes a better alpaca

Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, and Kenneth Heafield. Monolingual or multilingual instruction tuning: Which makes a better alpaca. InFindings of the Association for Computational Linguistics: EACL 2024, pp. 1347–1356,

2024
[3]

Julen Etxaniz, Gorka Azkune, Aitor Soroa, Oier Lopez de Lacalle, and Mikel Artetxe. Do multilingual language models think better in english? InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies (Volume 2: Short Papers), pp. 550–564,

2024
[4]

Benchmax: A comprehensive multilingual evaluation suite for large language models

Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, and Fei Yuan. Benchmax: A comprehensive multilingual evaluation suite for large language models. arXiv preprint arXiv:2502.07346,

work page arXiv
[5]

Gemma 3 Technical Report

Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 4,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Turning english-centric llms into poly- glots: How much multilinguality is needed? InFindings of the Association for Computational Linguistics: EMNLP 2024, pp

Tannon Kew, Florian Schottmann, and Rico Sennrich. Turning english-centric llms into poly- glots: How much multilinguality is needed? InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 13097–13124,

2024
[7]

Evaluating the diversity, equity, and inclusion of nlp technology: A case study for indian languages

Simran Khanuja, Sebastian Ruder, and Partha Talukdar. Evaluating the diversity, equity, and inclusion of nlp technology: A case study for indian languages. InFindings of the Association for Computational Linguistics: EACL 2023, pp. 1763–1777,

2023
[8]

Massive-agents: A benchmark for multilingual function-calling in 52 languages

Mayank Kulkarni, Vittorio Mazzia, Judith Gaspers, Christopher Hench, Jack FitzGerald, and AGI Amazon. Massive-agents: A benchmark for multilingual function-calling in 52 languages. InFindings of the Association for Computational Linguistics: EMNLP 2025, pp. 20193–20215,

2025
[9]

mcot: Multilingual instruction tuning for reasoning consistency in language models.arXiv preprint arXiv:2406.02301,

Huiyuan Lai and Malvina Nissim. mcot: Multilingual instruction tuning for reasoning consistency in language models.arXiv preprint arXiv:2406.02301,

work page arXiv
[10]

arXiv preprint arXiv:2304.08244 , year=

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms.arXiv preprint arXiv:2304.08244,

work page arXiv
[11]

Niklas Muennighoff, Alexander M

Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I Hsu, Isaac Caswell, Alex Pent- land, Sercan Arik, Chen-Yu Lee, Sayna Ebrahimi, et al. Atlas: Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality. arXiv preprint arXiv:2510.22037,

work page arXiv
[12]

Do multilingual llms think in english?arXiv preprint arXiv:2502.15603,

Lisa Schut, Yarin Gal, and Sebastian Farquhar. Do multilingual llms think in english?arXiv preprint arXiv:2502.15603,

work page arXiv
[13]

InFindings of the Associ- ation for Computational Linguistics: EMNLP 2024, pages 12891–12907, Miami, Florida, USA

10 Preprint. Under review. Uri Shaham, Jonathan Herzig, Roee Aharoni, Idan Szpektor, Reut Tsarfaty, and Matan Eyal. Multilingual instruction tuning with just a pinch of multilinguality.arXiv preprint arXiv:2401.01854,

work page arXiv
[14]

arXiv preprint arXiv:2210.03057 , year=

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Language models are multilingual chain-of-thought reasoners.arXiv preprint arXiv:2210.03057,

work page arXiv
[15]

A post-trainer’s guide to multilingual training data: Uncovering cross-lingual transfer dynamics.arXiv preprint arXiv:2504.16677,

Luisa Shimabucoro, Ahmet Ustun, Marzieh Fadaee, and Sebastian Ruder. A post-trainer’s guide to multilingual training data: Uncovering cross-lingual transfer dynamics.arXiv preprint arXiv:2504.16677,

work page arXiv
[16]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv