pith. machine review for the scientific record. sign in

arxiv: 2604.13286 · v1 · submitted 2026-04-14 · 💻 cs.CL · cs.AI

Recognition: unknown

English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training

Dezhi Hong, Mehak Dhaliwal, Shashwat Chaurasia, Thomas Butler, Yao Qin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multilingual LLMspost-trainingsupervised fine-tuningcross-lingual transferlanguage coveragemathematical reasoningAPI calling
0
0 comments X

The pith

Adding languages during post-training improves LLM performance on English and other languages alike.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs 220 controlled fine-tuning experiments on parallel translated data covering mathematical reasoning and API calling tasks. It tests how language coverage interacts with model scale up to 8 billion parameters. Results show that broader language inclusion helps most tasks, with the largest gains for low-resource languages and no drop for high-resource ones. Even adding a single non-English language lifts English results and cross-lingual abilities. This indicates that English-only post-training leaves performance on the table.

Core claim

Through systematic experiments, the authors show that increasing language coverage during post-training is largely beneficial across tasks and model scales, with low-resource languages benefiting the most and high-resource languages plateauing rather than degrading. Even minimal multilinguality helps: incorporating a single non-English language improves both English performance and cross-lingual generalization, making English-only post-training largely suboptimal. Moreover, at sufficient language diversity, zero-shot cross-lingual transfer can match or exceed the effects of direct language inclusion in a low-diversity setting, although gains remain limited for typologically distant, low-ress

What carries the argument

The controlled variation of language coverage in parallel translated multilingual data mixtures for supervised fine-tuning on mathematical reasoning and API calling tasks.

If this is right

  • Low-resource languages receive the largest performance gains as more languages enter the post-training mix.
  • High-resource languages maintain stable performance rather than declining with added language diversity.
  • English task accuracy rises when even one non-English language is included in the data.
  • Zero-shot cross-lingual transfer becomes competitive with direct training once overall language diversity is high.
  • The benefits appear consistently across model sizes up to 8 billion parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Post-training pipelines could default to multilingual mixtures to raise overall capability without separate English-only runs.
  • Typologically distant languages may still need targeted methods beyond simple data inclusion to reach full performance.
  • Similar language-coverage effects might appear in other post-training stages such as preference tuning if tested the same way.

Load-bearing premise

The translated data mixtures preserve task meaning and difficulty across languages without introducing translation artifacts or quality loss.

What would settle it

Repeating the experiments on naturally occurring non-translated multilingual data and observing that performance stops improving or declines as language coverage increases.

Figures

Figures reproduced from arXiv: 2604.13286 by Dezhi Hong, Mehak Dhaliwal, Shashwat Chaurasia, Thomas Butler, Yao Qin.

Figure 1
Figure 1. Figure 1: Overview of the experimental design. We start from a task-specific multilingual parallel data pool consisting of five core languages, which are used to construct exhaustive data mixture combinations, and five additional languages that enable scaling the exper￾iments to up to ten languages. We generate 22 data mixtures with increasing language counts and varying combinations; within the multilingual data mi… view at source ↗
Figure 2
Figure 2. Figure 2: Effect of increasing training language coverage on model performance for Qwen￾3 and Gemma-3 models. Plots show average accuracy (%) with 95% Wilson confidence intervals as a function of the number of training languages, grouped by high-resource and low-resource evaluation languages, for API calling (top) and math reasoning (bottom). languages), we evaluate multiple combinations of the core languages, as in… view at source ↗
Figure 3
Figure 3. Figure 3: Median accuracy (%) change from bilingual versus English-only post-training across evaluation settings for API calling and math reasoning. Error bars show 95% bootstrap confidence intervals. Bilingual training yields consistent gains across evaluation settings, with the largest improvements under direct exposure and smaller but reliable gains under cross-lingual transfer. high-resource languages plateauing… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of zero-shot cross-lingual transfer versus direct bilingual exposure at varying levels of language diversity for Qwen-3 8B (top) and Gemma-3 4B (bottom), with 4 (left), 6 (middle), and 9 (right) training languages. High-resource languages (red) tend to cluster near the diagonal, indicating strong zero-shot generalization that can compensate for the absence of direct inclusion. Low-resource langu… view at source ↗
Figure 5
Figure 5. Figure 5: Example showing the final turns of a parallel multi-turn API calling interaction [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt used to create mAPICall-Bank A mAPICall-Bank Construction We construct mAPICall-Bank by translating the API calling subset of the original English API-Bank dataset into multiple target languages using a state-of-the-art large language model. The full translation prompt is provided in [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Change in accuracy relative to the pre-trained baseline for bilingual versus English [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-language multilingual scaling trends for (a) Qwen-3 models, and (b) Gemma-3 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of zero-shot cross-lingual transfer and direct bilingual exposure for smaller models. Rows correspond to Qwen-3 1.7B (top), Qwen-3 0.6B (middle), and Gemma-3 1B (bottom), with columns showing 4 (left), 6 (middle), and 9 (right) training languages. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

Despite the widespread multilingual deployment of large language models, post-training pipelines remain predominantly English-centric, contributing to performance disparities across languages. We present a systematic, controlled study of the interplay between training language coverage, model scale, and task domain, based on 220 supervised fine-tuning runs on parallel translated multilingual data mixtures spanning mathematical reasoning and API calling tasks, with models up to 8B parameters. We find that increasing language coverage during post-training is largely beneficial across tasks and model scales, with low-resource languages benefiting the most and high-resource languages plateauing rather than degrading. Even minimal multilinguality helps: incorporating a single non-English language improves both English performance and cross-lingual generalization, making English-only post-training largely suboptimal. Moreover, at sufficient language diversity, zero-shot cross-lingual transfer can match or exceed the effects of direct language inclusion in a low-diversity setting, although gains remain limited for typologically distant, low-resource languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper reports a controlled empirical study of 220 supervised fine-tuning runs on models up to 8B parameters, using parallel translated mixtures of mathematical reasoning and API calling tasks. It claims that increasing language coverage during post-training is largely beneficial, with low-resource languages gaining the most and high-resource languages plateauing; that even a single non-English language improves both English performance and cross-lingual generalization; and that, at sufficient diversity, zero-shot transfer can match direct inclusion.

Significance. If the translation fidelity holds, the scale of the controlled experiments (220 runs across scales and tasks) and the finding that minimal multilinguality improves English performance would provide actionable guidance against English-only post-training pipelines. The systematic variation of language coverage, model size, and task domain strengthens the potential contribution to multilingual LLM training practices.

major comments (1)
  1. [Experimental setup / data mixtures] The description of the parallel translated multilingual data mixtures (abstract and experimental setup): the headline claims rest on translated versions of the original English problems for both mathematical reasoning and API calling. No back-translation accuracy, human task-fidelity checks, or controls isolating translation artifacts from genuine language-coverage effects are reported. Because translation errors in numbers, operators, or API descriptions would directly alter task difficulty or functional equivalence, gains attributed to multilinguality could instead reflect increased token volume, regularization from noise, or data-quality artifacts; this assumption is load-bearing for the central empirical result.
minor comments (2)
  1. [Abstract / Results] The abstract states that the study comprises 220 controlled runs but provides no visible details on statistical significance testing, variance across seeds, or exact data-mixture token counts per language; adding these would strengthen the reported improvements.
  2. [Results] The paper would benefit from an explicit error analysis or qualitative examples showing how performance changes with added languages, particularly for the typologically distant low-resource languages where gains remain limited.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying a key assumption in our experimental design. We address the concern regarding translation fidelity below and outline concrete revisions.

read point-by-point responses
  1. Referee: [Experimental setup / data mixtures] The description of the parallel translated multilingual data mixtures (abstract and experimental setup): the headline claims rest on translated versions of the original English problems for both mathematical reasoning and API calling. No back-translation accuracy, human task-fidelity checks, or controls isolating translation artifacts from genuine language-coverage effects are reported. Because translation errors in numbers, operators, or API descriptions would directly alter task difficulty or functional equivalence, gains attributed to multilinguality could instead reflect increased token volume, regularization from noise, or data-quality artifacts; this assumption is load-bearing for the central empirical result.

    Authors: We agree that translation quality is a load-bearing assumption and that the absence of explicit fidelity checks leaves open the possibility of confounding artifacts. In the original experiments the data were produced via a high-quality neural machine translation pipeline applied to the English sources, but we did not quantify fidelity or run controls. In the revised manuscript we will add a dedicated subsection that reports: (i) back-translation BLEU scores on held-out samples for both the mathematical-reasoning and API-calling mixtures, (ii) human evaluation of task equivalence on a stratified sample of 200 instances (100 per task) performed by native speakers, and (iii) a control condition in which the English-only training set is augmented with synthetic noise and extra tokens to match the volume and approximate noise characteristics of the multilingual mixtures. These additions will allow readers to assess whether the reported gains are driven by language coverage rather than data artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements from controlled SFT runs

full rationale

The paper presents results exclusively from 220 supervised fine-tuning experiments on parallel translated data mixtures for math reasoning and API calling tasks. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness theorems appear in the abstract or described methodology. All headline claims (benefits of language coverage, minimal multilinguality effects, zero-shot transfer) are stated as direct experimental outcomes rather than reductions to prior inputs or self-referential definitions. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical paper with no free parameters or invented entities; central claims rest on standard domain assumptions about data translation fidelity and task representativeness in LLM fine-tuning.

axioms (1)
  • domain assumption Parallel translated data mixtures accurately preserve task semantics and difficulty across languages.
    Required for the controlled comparison of language coverage effects to be valid.

pith-pipeline@v0.9.0 · 5478 in / 1319 out tokens · 52675 ms · 2026-05-10T15:02:21.727486+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 10 canonical work pages · 2 internal anchors

  1. [1]

    When is multilinguality a curse? language modeling for 250 high-and low-resource languages

    Tyler A Chang, Catherine Arnett, Zhuowen Tu, and Ben Bergen. When is multilinguality a curse? language modeling for 250 high-and low-resource languages. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 4074–4096,

  2. [2]

    Monolingual or multilingual instruction tuning: Which makes a better alpaca

    Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, and Kenneth Heafield. Monolingual or multilingual instruction tuning: Which makes a better alpaca. InFindings of the Association for Computational Linguistics: EACL 2024, pp. 1347–1356,

  3. [3]

    Julen Etxaniz, Gorka Azkune, Aitor Soroa, Oier Lopez de Lacalle, and Mikel Artetxe. Do multilingual language models think better in english? InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies (Volume 2: Short Papers), pp. 550–564,

  4. [4]

    Benchmax: A comprehensive multilingual evaluation suite for large language models

    Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, and Fei Yuan. Benchmax: A comprehensive multilingual evaluation suite for large language models. arXiv preprint arXiv:2502.07346,

  5. [5]

    Gemma 3 Technical Report

    Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 4,

  6. [6]

    Turning english-centric llms into poly- glots: How much multilinguality is needed? InFindings of the Association for Computational Linguistics: EMNLP 2024, pp

    Tannon Kew, Florian Schottmann, and Rico Sennrich. Turning english-centric llms into poly- glots: How much multilinguality is needed? InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 13097–13124,

  7. [7]

    Evaluating the diversity, equity, and inclusion of nlp technology: A case study for indian languages

    Simran Khanuja, Sebastian Ruder, and Partha Talukdar. Evaluating the diversity, equity, and inclusion of nlp technology: A case study for indian languages. InFindings of the Association for Computational Linguistics: EACL 2023, pp. 1763–1777,

  8. [8]

    Massive-agents: A benchmark for multilingual function-calling in 52 languages

    Mayank Kulkarni, Vittorio Mazzia, Judith Gaspers, Christopher Hench, Jack FitzGerald, and AGI Amazon. Massive-agents: A benchmark for multilingual function-calling in 52 languages. InFindings of the Association for Computational Linguistics: EMNLP 2025, pp. 20193–20215,

  9. [9]

    mcot: Multilingual instruction tuning for reasoning consistency in language models.arXiv preprint arXiv:2406.02301,

    Huiyuan Lai and Malvina Nissim. mcot: Multilingual instruction tuning for reasoning consistency in language models.arXiv preprint arXiv:2406.02301,

  10. [10]

    arXiv preprint arXiv:2304.08244 , year=

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms.arXiv preprint arXiv:2304.08244,

  11. [11]

    Niklas Muennighoff, Alexander M

    Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I Hsu, Isaac Caswell, Alex Pent- land, Sercan Arik, Chen-Yu Lee, Sayna Ebrahimi, et al. Atlas: Adaptive transfer scaling laws for multilingual pretraining, finetuning, and decoding the curse of multilinguality. arXiv preprint arXiv:2510.22037,

  12. [12]

    Do multilingual llms think in english?arXiv preprint arXiv:2502.15603,

    Lisa Schut, Yarin Gal, and Sebastian Farquhar. Do multilingual llms think in english?arXiv preprint arXiv:2502.15603,

  13. [13]

    InFindings of the Associ- ation for Computational Linguistics: EMNLP 2024, pages 12891–12907, Miami, Florida, USA

    10 Preprint. Under review. Uri Shaham, Jonathan Herzig, Roee Aharoni, Idan Szpektor, Reut Tsarfaty, and Matan Eyal. Multilingual instruction tuning with just a pinch of multilinguality.arXiv preprint arXiv:2401.01854,

  14. [14]

    arXiv preprint arXiv:2210.03057 , year=

    Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Language models are multilingual chain-of-thought reasoners.arXiv preprint arXiv:2210.03057,

  15. [15]

    A post-trainer’s guide to multilingual training data: Uncovering cross-lingual transfer dynamics.arXiv preprint arXiv:2504.16677,

    Luisa Shimabucoro, Ahmet Ustun, Marzieh Fadaee, and Sebastian Ruder. A post-trainer’s guide to multilingual training data: Uncovering cross-lingual transfer dynamics.arXiv preprint arXiv:2504.16677,

  16. [16]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,