arxiv: 2604.11290 · v1 · submitted 2026-04-13 · 💻 cs.CL

Recognition: unknown

Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation

Lester James V. Miranda , Ivan Vuli\'c , Anna Korhonen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual synthetic datalanguage model teachersdata quality metricsstudent model performancepolyglot evaluationsupervised fine-tuningnon-English language models

0 comments

The pith

Gemma 3 27B and Aya Expanse 32B generate the most effective multilingual synthetic data for training smaller student models, with prompt diversity, length, and response fluency predicting performance better than model scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates ten language models as teachers that create supervised fine-tuning data for six typologically different languages. It generates 1.4 million examples and trains 240 student models to compare intrinsic data properties against downstream results. The central finding is that two mid-sized models outperform larger ones on average, and three measurable data qualities account for over 93 percent of the variation in how well the synthetic data works. This matters because practitioners often default to the biggest available model for multilingual tasks, yet that choice can produce weaker students than selecting teachers by observable data traits. The work also offers concrete guidance on pairing teacher and student families and on using translation to boost results for lower-resource languages.

Core claim

When language models are used to synthesize supervised fine-tuning data across six languages, Gemma 3 27B and Aya Expanse 32B produce consistently higher-quality examples that improve student performance across different base model families. Model scale alone shows little correlation with teacher quality. Instead, prompt diversity, response length, and fluency together explain more than 93.3 percent of the variance in intrinsic data quality and directly predict how well students perform on multilingual benchmarks.

What carries the argument

The Polyglot Score, a combined metric that links intrinsic measures of generated data (diversity, length, fluency) to extrinsic student model accuracy after training.

If this is right

Matching the model family of the teacher to the student yields better downstream results than using the largest available teacher regardless of family.
For lower-resource languages, translating existing English prompts or having the teacher respond in the target language improves data quality and student performance.
Practitioners can screen candidate teachers by measuring prompt diversity, response length, and fluency on a small sample rather than running full student trainings.
Scale is not a reliable proxy for teacher quality in multilingual synthetic data generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data-quality signals could be used to filter or rewrite existing synthetic datasets before training.
The approach may generalize to other data-generation tasks such as preference tuning or reasoning chains if similar quality metrics are defined.
Developers of new teacher models could optimize directly for the three measured data traits rather than for general capability benchmarks.

Load-bearing premise

The Polyglot Score and the specific set of six languages, ten teacher models, and 240 student trainings accurately reflect how teacher effectiveness would behave in broader real-world multilingual settings.

What would settle it

Retraining the same student architectures on new synthetic data from the same teachers but in additional languages or with different evaluation tasks shows that the top teachers change or that the three data-quality features no longer account for most of the variance in student results.

Figures

Figures reproduced from arXiv: 2604.11290 by Anna Korhonen, Ivan Vuli\'c, Lester James V. Miranda.

**Figure 1.** Figure 1: Overview of our method for evaluating language models as multilingual teachers (POLYGLOT SCORE). We evaluate teacher models on their synthetic data generation capabilities across three methods: Generate a prompt-response pair given few-shot examples, Translate prompts from English and generate a response, and Respond to a prompt in the target language. The POLYGLOT SCORE incorporates both intrinsic data qu… view at source ↗

**Figure 2.** Figure 2: PG-SCORE across different base models (average across Arabic, German, and Indonesian). Left: Average PG-SCORE of each teacher model on students finetuned on three different base models. We highlight the top , second , and third best teacher models for each setting. Right: Heatmap showing Spearman rank correlation ρ of teacher model rankings across base models. We show percentage increases in PG-SCORE on [… view at source ↗

**Figure 4.** Figure 4: Fit of a linear regression model on the PCs of the intrinsic metrics to predict student performance. Intrinsic metrics, via their PCs, can predict extrinsic student performance (R2 = 0.664 and RMSE = 0.440) on multilingual benchmarks (§2.3). shows the fit of a linear model on the test set when the PCs learn to predict student performance. We observe that interactions within the intrinsic metrics can pre… view at source ↗

**Figure 5.** Figure 5: Effect of synthetic data scale on student model performance. Student performance improves with more synthetic data, but gains diminish beyond 10k examples. language is sufficient to reliably compute PGSCORE without inflating the metric by increasing the number of samples. In our experiments, we use 10k synthetic examples per language when computing PG-SCORE. Specifically, we show that 10k synthetic examp… view at source ↗

**Figure 7.** Figure 7: Effect of weighing intrinsic and extrinsic metrics in PG-SCORE. Model rankings remain relatively stable across neighboring weightings of intrinsic and extrinsic metrics. not always produce better training data (§3). In addition, comparing 10K-Polyglot-TL to other models in the FILBENCH leaderboard5 shows that the former is competitive against Qwen 3 4B and Llama 3.1 8B Instruct. We highlight that our 4B m… view at source ↗

**Figure 8.** Figure 8: Relationship between a language’s percentage in CommonCrawl and PG-SCORE. We observe a suggestive positive trend (ρ = 0.886, p<0.05) between CommonCrawl representation and PG-SCORE across the six languages tested. are additive. Curation of publicly-available data vs. Synthetic data generation We compare student models trained on (1) publicly-available Tagalog SFT data and (2) synthetic SFT instances gen… view at source ↗

**Figure 9.** Figure 9: Student model performance on a held-out language (Tagalog) across several synthetic data interventions. Given a held-out language (Tagalog) and an evaluation benchmark (FILBENCH), we apply data interventions based on our recommendations on creating a multilingual synthetic data recipe (§5). ferent model sizes while maintaining the relative ranking of teacher models. However, we note that the performance o… view at source ↗

**Figure 10.** Figure 10: Prompt template for the Generate data generation method. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt template for the Translate data generation method. Respond: take prompts from Dseed,ℓ and use teacher T to generate the response yi As a multilingual data generator, you will be presented a user request or instruction in the {lang_name} language. Your task is to generate an appropriate response for the given request. Ensure that your response is coherent, culturally appropriate, and demonstrates a … view at source ↗

**Figure 12.** Figure 12: Prompt template for the Respond data generation method. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: We evaluate text quality of synthesized texts using a multilingual rubric model called M-Prometheus [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

read the original abstract

Synthesizing supervised finetuning (SFT) data from language models (LMs) to teach smaller models multilingual tasks has become increasingly common. However, teacher model selection is often ad hoc, typically defaulting to the largest available option, even though such models may have significant capability gaps in non-English languages. This practice can result in poor-quality synthetic data and suboptimal student downstream performance. In this work, we systematically characterize what makes an effective multilingual teacher. We measure intrinsic measures of data quality with extrinsic student model performance in a metric we call Polyglot Score; evaluating 10 LMs across 6 typologically diverse languages, generating over 1.4M SFT examples and training 240 student models. Among the models tested, Gemma 3 27B and Aya Expanse 32B emerge as consistently effective teachers across different student base model families. Further analyses reveal that model scale alone does not significantly predict teacher effectiveness; instead, data qualities such as prompt diversity, length, and response fluency capture over 93.3% of variance in intrinsic data quality and predict student performance. Finally, we provide practical recommendations, including matching the model families of teacher-student pairs and translating from or responding to existing prompts, which can yield improvements for less-resourced languages. We hope that our work advances data-centric research in multilingual synthetic data and LM development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows mid-sized models like Gemma 3 27B can beat larger ones as multilingual SFT teachers, with data qualities such as diversity and fluency explaining most performance variance rather than scale.

read the letter

This paper finds that for generating multilingual SFT data, certain mid-sized models outperform larger ones as teachers, and data quality features like diversity and fluency predict student gains better than scale alone. They ran a big comparison: 10 teacher models, 6 languages, over a million examples, and 240 student trainings. The new Polyglot Score ties together how good the synthetic data looks on its own with how well students actually perform after training on it. That lets them show Gemma 3 27B and Aya Expanse 32B as reliable across different student families. The regression showing data qualities explain 93.3% of variance in quality is a clear result, and they back it with recommendations like using existing prompts or matching model families. The experiments are large enough to give real weight to the claim that scale by itself is not the main driver. They also test across typologically different languages, which adds some breadth. The soft spot is the lack of external checks. The variance figure and the model rankings both come from the same set of generated data and the same 6 languages. If the Polyglot Score is tuned to this exact pipeline, the predictors might not travel to new languages or tasks. No held-out validation or alternative evaluation is described, so the 93% number could be partly setup-specific. Readers working on data-centric methods for multilingual models will get the most from this. It gives concrete guidance on teacher selection instead of defaulting to the biggest model. The work is empirical and addresses a real practical question, so it deserves a serious referee even with the generalization questions. I would send it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that by evaluating 10 LMs as teachers for multilingual SFT data generation across 6 typologically diverse languages (producing 1.4M examples and training 240 student models), Gemma 3 27B and Aya Expanse 32B emerge as consistently effective teachers independent of student base model family. Model scale alone does not predict effectiveness; instead, intrinsic data qualities (prompt diversity, length, response fluency) explain over 93.3% of variance in the introduced Polyglot Score and predict student performance. Practical recommendations include matching teacher-student model families and using translation from existing prompts.

Significance. If the central empirical findings hold under further validation, this work is significant for multilingual NLP and data-centric LM research. The scale of the experiments provides a rare systematic comparison that challenges the default to largest models for synthetic data and links specific data qualities to downstream gains. The Polyglot Score and recommendations could offer a practical framework for improving synthetic data quality in low-resource settings, advancing beyond ad-hoc teacher selection.

major comments (2)

[Further analyses] Further analyses section: the claim that prompt diversity, length, and response fluency capture over 93.3% of variance in intrinsic data quality and predict student performance rests on regression performed on the same 6-language, 10-model, 240-student dataset without reported held-out languages, cross-validation, or out-of-sample testing; this risks the result being an artifact of the specific typological sample and SFT format rather than a general property of teacher effectiveness.
[Polyglot Score] Polyglot Score definition and validation: the metric aggregates intrinsic qualities and is central to all model rankings and variance claims, yet the manuscript provides no external validation on additional languages, tasks, or real-world downstream applications beyond the experimental pipeline, leaving open whether the 93.3% figure and the superiority of Gemma 3 27B / Aya Expanse 32B generalize.

minor comments (2)

[Abstract] Abstract: the phrasing 'capture over 93.3% of variance' should be accompanied in the main text by the precise regression specification (e.g., adjusted R², number of predictors, multicollinearity checks) and any statistical significance or error estimates.
[Experimental setup] The manuscript should clarify how the 240 student trainings were distributed across the 10 teachers and 6 languages to allow assessment of balance and potential confounding in the effectiveness comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We appreciate the emphasis on the need for stronger validation of our regression analyses and the Polyglot Score. We address each major comment below, indicating where we will revise the manuscript to incorporate the feedback while preserving the integrity of our empirical findings.

read point-by-point responses

Referee: Further analyses section: the claim that prompt diversity, length, and response fluency capture over 93.3% of variance in intrinsic data quality and predict student performance rests on regression performed on the same 6-language, 10-model, 240-student dataset without reported held-out languages, cross-validation, or out-of-sample testing; this risks the result being an artifact of the specific typological sample and SFT format rather than a general property of teacher effectiveness.

Authors: We acknowledge that the reported regression was fit on the full dataset without explicit cross-validation or held-out evaluation. However, the underlying data already spans 6 typologically diverse languages and 10 teacher models with 240 downstream student evaluations, which provides substantial variation for observing the relationships. To directly address the concern, we will add leave-one-language-out cross-validation in the revised manuscript: the regression will be retrained on five languages and evaluated on the held-out language, with results reported for each fold. This will quantify whether the >93% variance explained and the predictive power for student performance hold across different language subsets rather than being an artifact of the full sample. We will also report the exact regression specification, including the number of observations and the three predictors, to improve transparency. revision: yes
Referee: Polyglot Score definition and validation: the metric aggregates intrinsic qualities and is central to all model rankings and variance claims, yet the manuscript provides no external validation on additional languages, tasks, or real-world downstream applications beyond the experimental pipeline, leaving open whether the 93.3% figure and the superiority of Gemma 3 27B / Aya Expanse 32B generalize.

Authors: The Polyglot Score is validated internally through its strong correlation with extrinsic student performance across 240 models and consistent teacher rankings that hold across multiple student base-model families. We agree that external validation on entirely new languages, tasks, or real-world applications would provide further evidence of generalizability. In the revision we will expand the discussion section to explicitly state this limitation and frame broader validation as important future work. We will also include additional details on the score's construction and its per-language correlations with downstream metrics. Because new large-scale experiments on unseen languages fall outside the scope of the current revision, we view these clarifications and the planned cross-validation as the appropriate response while preserving the scale and internal consistency of the existing results. revision: partial

Circularity Check

0 steps flagged

No circularity; purely empirical measurements and regression on generated data

full rationale

The paper performs a large-scale experimental evaluation: 10 teacher LMs generate 1.4M SFT examples in 6 languages, 240 student models are trained, intrinsic data qualities are measured, and a composite Polyglot Score is defined to correlate them with extrinsic performance. The reported regression (data qualities capturing 93.3% variance) is an analysis of the collected observations rather than a derivation that reduces to its inputs by construction. No equations, self-citations, uniqueness theorems, or ansatzes are invoked; all claims rest on direct experimental outcomes from the described pipeline. This is self-contained empirical work with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on empirical measurements and a newly defined metric rather than mathematical derivations; assumptions include representativeness of the language set and fairness of student training protocols.

axioms (2)

domain assumption The six typologically diverse languages chosen are representative for evaluating multilingual teacher effectiveness.
Paper evaluates across 6 typologically diverse languages as stated in abstract.
domain assumption Student model training protocols are standardized and comparable across all teacher conditions.
240 student models trained as part of the evaluation.

invented entities (1)

Polyglot Score no independent evidence
purpose: Composite metric evaluating teacher effectiveness via intrinsic data quality and extrinsic student performance.
New metric introduced to characterize effective multilingual teachers.

pith-pipeline@v0.9.0 · 5549 in / 1385 out tokens · 59802 ms · 2026-05-10T15:27:19.978549+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multilinguality at the Edge: Developing Language Models for the Global South
cs.CL 2026-04 unverdicted novelty 5.0

A survey of 232 papers on the intersection of multilingual language modeling and edge deployment identifies the 'last mile' challenge for Global South communities and offers recommendations for more inclusive NLP.

Reference graph

Works this paper leans on

27 extracted references · 5 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

On the Diversity of Synthetic Data and its Im- pact on Training Large Language Models.Preprint, arXiv:2410.15226. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman

work page arXiv
[2]

Training Verifiers to Solve Math Word Prob- lems.Preprint, arXiv:2110.14168. Cohere Team, Aakanksha, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Milad Alizadeh, Yazeed Al- numay, Sophia Althammer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, Raphaël Avalos, Zahara Aviv, Sammie Bae, Saurabh Baji, Alexan- dre Barbet, Max Bartolo, Björn Bebensee, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 43–58, Vienna, Austria

M-RewardBench: Evaluating Reward Models in Multilingual Settings. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 43–58, Vienna, Austria. Association for Computational Lin- guistics. Nathan Habib, Clémentine Fourrier, Hynek Kydlíˇcek, Thomas Wolf, and Lewis Tunstall. 2023. LightEval:...

2023
[4]

InInternational Conference on Learn- ing Representations

Measuring Massive Multitask Language Un- derstanding. InInternational Conference on Learn- ing Representations. Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. InProceedings of the 58th Annual Meeting of the Association for Computational Ling...

2020
[5]

FastText.zip: Compressing text classification models

FastText.zip: Compressing text classification models.Preprint, arXiv:1612.03651. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of Tricks for Efficient Text Classification. InProceedings of the 15th Con- ference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Pa- pers, pages 427–431, ...

work page Pith review arXiv 2017
[6]

Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M

EuroLLM-9B: Technical Report.Preprint, arXiv:2506.04079. Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Ma- teusz Klimaszewski, Pierre Colombo, Barry Haddow, José G. C. de Souza, Alexandra Birch, and André F. T. Martins. 2024. EuroLLM: Multilingual Language...

work page arXiv 2024
[7]

InProceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18761–18799, Vi- enna, Austria

Global MMLU: Understanding and Address- ing Cultural and Linguistic Biases in Multilingual Evaluation. InProceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18761–18799, Vi- enna, Austria. Association for Computational Lin- guistics. Shivalika Singh, Freddie Vargus, Daniel D’souza, Börje...

work page arXiv 2024
[8]

synthetic dataset to compute PG-SCORE

and thestaticvectorslibrary. synthetic dataset to compute PG-SCORE. ResultsFigure 6 shows the PG-SCOREand aver- age benchmark performance of the student model for each translation method across Arabic, Ger- man, and Indonesian. We find thatLM-Translate outperforms both NLLB-based approaches, achiev- ing an average PG-SCORE of 1.36 compared to 0.85 forNLLB...

2025
[9]

10K-Public: we sample 10k Tagalog prompt- response pairs from the seed dataset. This base- line aims to simulate a non-synthetic data ap- 22 NLLB Translate Both NLLB Translate then Respond LM Translate0.00 0.25 0.50 0.75 1.00 1.25 1.50 Average PG-Score 0.85 0.80 1.36 NLLB Translate Both NLLB Translate then Respond LM Translate0.0 0.5 1.0 1.5 2.0 PG-Score ...
[10]

This baseline simulates a typical data generation approach of choosing a teacher in an ad hoc manner due to its perceived strength (size or benchmark performance) or ease of use

10K-GPT-4oM: we synthesize 10k instances using an off-the-shelf teacher model (GPT-4o- mini). This baseline simulates a typical data generation approach of choosing a teacher in an ad hoc manner due to its perceived strength (size or benchmark performance) or ease of use. For all methods, we finetune a Gemma 3 4B base model using the same training setting...

2024
[11]

For each teacher model, we check whether the model provider recommended best settings for usage

and Curator (Marten et al., 2025) for infer- ence. For each teacher model, we check whether the model provider recommended best settings for usage. If not, then we set a default configuration (temperature=0.8, top_p=0.9). Table 17 summa- rizes the inference settings we used for each teacher model. 25 Generate:samplekprompt-response pairs fromD seed,ℓ and ...

2025
[12]

The task should be very challenging yet solvable

A "prompt" specifying a task to be completed or a question to be answered (what, where, when, how, who, why). The task should be very challenging yet solvable
[13]

response

A "response" representing a valid completion of that task in natural language. If the "response" does not satisfy the "prompt", then you have failed at your job. Do not provide unnecessary details, beyond what is explicitly needed to satisfy the instruction you generated. Hard constraint: The generated task MUST belong to exactly one of the following cate...
[14]

Logical reasoning / error analysis
[15]

Math or quantitative reasoning with explanation
[16]

Classification or labeling
[17]

Dialogue or role-play
[18]

Translation or paraphrasing with constraints
[19]

Procedural instructions (step-by-step)
[20]

Grammar correction or linguistic analysis
[21]

Short-form creative output (≤50 words)
[22]

Knowledge recall with verification or correction
[23]

prompt”]}} Response: {{example[“response

Cultural or pragmatic judgment Add diversity to your generations by varying the types of tasks you create, the styles and tones of the responses, and the complexity of the language used. This will help ensure a rich and varied dataset. For example, you might create tasks that involve answering knowledge-based questions, answering math questions, providing...
[24]

Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general
[25]

You should refer to the score rubric

After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric
[26]

The output should contain the score and feedback only
[27]

Please do not generate any other opening, closing, and explanations. The instruction to evaluate: {{instruction}} Response to evaluate: {{response}} Score Rubrics: [Is the model proficient in language{lang_name}, including its cultural nuance and gram- matical usage, and responds in a helpful and harmless manner according to the instruction?] Score 1: The...

2025