arxiv: 2604.20549 · v1 · submitted 2026-04-22 · 💻 cs.CL · cs.AI

Recognition: unknown

Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection

Yassine Turki , Vinko Sabol\v{c}ec , Bettina Messmer , Martin Jaggi

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords cross-lingual transferquality filteringpretraining data selectionmultilingual LLMsdata curationembedding consistencylow-resource languageshigh-resource languages

0 comments

The pith

Massive multilingual pooling of quality signals outperforms monolingual baselines for pretraining data selection across languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether quality markers visible in embedding space are consistent across languages, allowing data from high-resource languages to help filter pretraining data for low-resource ones. The authors compare multilingual pooling strategies to monolingual baselines by training a 1B parameter model on 103 billion tokens. They find that pooling often delivers more stable quality rankings and higher aggregate accuracy, including a 1.2 percent gain for French, while performing at least as well for low-resource languages. Refinements such as sampling from the third quartile or adjusting retention rates further improve results for high-resource cases. A sympathetic reader would care because insufficient native high-quality data limits robust classifier training for most languages, and cross-lingual methods could scale better data curation.

Core claim

The paper establishes that massive multilingual pooling frequently outperforms monolingual baselines in both rank stability and aggregate accuracy when selecting pretraining data for a 1B model trained on 103B tokens. This approach yields gains for high-resource languages, such as a 1.2% increase in aggregate normalized accuracy for French, and matches or exceeds monolingual performance for low-resource languages. The authors note that scale alone does not guarantee stability and that for high-resource languages like French, third quartile sampling or retention rate tuning is required to fully utilize the multilingual signal.

What carries the argument

Cross-lingual consistency of quality markers in embedding space, enabling high-resource languages to subsidize filtering for low-resource languages through pooled classifiers.

If this is right

Multilingual pooling stabilizes quality rankings better than single-language training.
High-resource language data can enhance low-resource language filtering without performance loss.
Third quartile sampling refines decision boundaries to capture more of the multilingual benefit in high-resource settings.
Retention rate tuning optimizes the trade-off between data volume and quality in pooled setups.
Overall data selection for multilingual LLM pretraining becomes more effective with pooled signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Embedding spaces may encode language-independent notions of text quality that transfer across linguistic boundaries.
This method could lower the barrier for curating data in under-resourced languages by leveraging existing high-resource datasets.
Future work might explore whether similar consistency holds for other filtering criteria beyond quality, such as toxicity or diversity.

Load-bearing premise

Quality markers in embedding space exhibit sufficient cross-lingual consistency to let high-resource languages subsidize filtering for low-resource languages.

What would settle it

A direct comparison where monolingual quality classifiers achieve higher downstream task accuracy or more stable rankings than the multilingual pooled version across a broad set of languages and model scales.

Figures

Figures reproduced from arXiv: 2604.20549 by Bettina Messmer, Martin Jaggi, Vinko Sabol\v{c}ec, Yassine Turki.

**Figure 2.** Figure 2: Distribution of quality scores for French baseline (top) and Romance languages classifier [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of quality scores for Romance languages classifier without French [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of score distributions for 10K synthetically generated grammatically cor [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

read the original abstract

As Large Language Models (LLMs) scale, data curation has shifted from maximizing volume to optimizing the signal-to-noise ratio by performing quality filtering. However, for many languages, native high quality data is insufficient to train robust quality classifiers. This work investigates the idea that quality markers in embedding space may show cross-lingual consistency, which would allow high-resource languages to subsidize the filtering of low-resource ones. We evaluate various filtering strategies, including cross-lingual transfer, third quartile sampling (Q3), and retention rate tuning. Our results demonstrate that massive multilingual pooling frequently outperforms monolingual baselines in both rank stability and aggregate accuracy for a 1B model trained on 103B tokens, delivering gains for high resource languages (1.2% increase in aggregate normalized accuracy for French) and matching or exceeding monolingual baselines for low-resource languages. However, we find that scale alone does not guarantee stability. Furthermore, for high-resource languages like French, we show that refining the decision boundary through third quartile sampling (Q3) or tuning the retention rate is necessary to fully leverage the multilingual signal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that quality markers in embedding space may exhibit cross-lingual consistency, enabling high-resource languages to subsidize quality filtering for low-resource languages during multilingual pretraining data selection. It evaluates strategies including cross-lingual transfer, third quartile sampling (Q3), and retention rate tuning. For a 1B model trained on 103B tokens, massive multilingual pooling is shown to outperform monolingual baselines in rank stability and aggregate accuracy, with a 1.2% gain for French and parity or better for low-resource languages; scale alone does not guarantee stability, and boundary refinement is needed for high-resource cases.

Significance. If the cross-lingual consistency hypothesis is directly validated and gains hold after controlling for data volume and diversity, the work could meaningfully improve data curation efficiency for multilingual LLMs by reducing reliance on scarce native high-quality data per language. The large-scale empirical setup (1B model, 103B tokens) offers practical insights into pooling strategies and highlights the need for retention tuning.

major comments (3)

[Results] The central hypothesis of cross-lingual consistency in quality markers is supported only indirectly via downstream rank stability and accuracy gains; no direct diagnostic (e.g., classifier score correlation on parallel text or embedding-space feature analysis) is described to rule out alternative explanations such as increased data volume or diversity.
[Experiments] The 1.2% aggregate normalized accuracy gain for French (and parity claims for low-resource languages) is reported without specifying the monolingual baseline implementation, data splits, or statistical tests, preventing verification that improvements are attributable to cross-lingual transfer rather than confounds.
[Methods] No ablation isolates the cross-lingual component from the simple effect of pooling more tokens; this is load-bearing because the abstract frames the contribution around consistency-enabled subsidization of low-resource filtering.

minor comments (1)

[Abstract] The abstract introduces 'rank stability' and 'aggregate normalized accuracy' without inline definitions or references to their exact computation in the experimental section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the manuscript and outlining targeted revisions to strengthen the presentation of our results and methods.

read point-by-point responses

Referee: [Results] The central hypothesis of cross-lingual consistency in quality markers is supported only indirectly via downstream rank stability and accuracy gains; no direct diagnostic (e.g., classifier score correlation on parallel text or embedding-space feature analysis) is described to rule out alternative explanations such as increased data volume or diversity.

Authors: We acknowledge that direct diagnostics would provide stronger, more mechanistic evidence for the cross-lingual consistency hypothesis. Our evaluation centers on downstream metrics (rank stability and normalized accuracy) because these directly measure the practical utility for pretraining data selection. However, we agree this leaves room for alternative explanations. In the revised manuscript we will add a new analysis subsection that computes classifier score correlations on a held-out parallel corpus (e.g., FLORES) and examines embedding-space feature overlap across languages to more directly support the consistency claim and help rule out volume/diversity confounds. revision: partial
Referee: [Experiments] The 1.2% aggregate normalized accuracy gain for French (and parity claims for low-resource languages) is reported without specifying the monolingual baseline implementation, data splits, or statistical tests, preventing verification that improvements are attributable to cross-lingual transfer rather than confounds.

Authors: We apologize for the insufficient detail. The monolingual baselines used the identical classifier architecture, training objective, and hyper-parameters as the multilingual models, but were trained exclusively on language-specific data. Training/validation/test splits were identical across conditions (80/10/10 stratified by source). Statistical significance was assessed via bootstrap resampling (1,000 iterations) and paired t-tests across five random seeds, yielding p < 0.01 for the French gain. We will expand Section 4 with a dedicated “Baseline Implementation” paragraph, a table listing exact split sizes per language, and the full statistical test results. revision: yes
Referee: [Methods] No ablation isolates the cross-lingual component from the simple effect of pooling more tokens; this is load-bearing because the abstract frames the contribution around consistency-enabled subsidization of low-resource filtering.

Authors: We agree this ablation is important for isolating the cross-lingual signal. While the main experiments already include volume-controlled comparisons (monolingual baselines trained on the same total token count via adjusted retention rates), we did not present them as an explicit ablation study. In the revision we will add a new subsection (and corresponding figure) that directly compares (i) multilingual pooling at full scale, (ii) multilingual pooling subsampled to match monolingual token volume, and (iii) the original monolingual baselines, thereby clarifying the contribution of cross-lingual consistency beyond raw data volume. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation of filtering strategies

full rationale

The paper reports experimental comparisons of multilingual vs. monolingual data filtering for pretraining a 1B model on 103B tokens, measuring rank stability and normalized accuracy. No mathematical derivations, equations, or predictions are presented that reduce to fitted parameters or self-referential inputs by construction. Claims rest on observed performance deltas (e.g., 1.2% gain for French) rather than any tautological redefinition or load-bearing self-citation chain. The work is self-contained as direct empirical evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption of cross-lingual consistency in quality markers within embedding space; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Quality markers in embedding space may show cross-lingual consistency allowing high-resource languages to subsidize filtering of low-resource ones
Directly stated as the core idea investigated in the abstract.

pith-pipeline@v0.9.0 · 5502 in / 1362 out tokens · 51626 ms · 2026-05-09T23:40:47.858905+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 8 canonical work pages

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[3]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[4]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[5]

2020 , booktitle = Oakland, keywords =

Sushant Dinesh and Nathan Burow and Dongyan Xu and Mathias Payer , title =. 2020 , booktitle = Oakland, keywords =

2020
[6]

2025 , eprint=

Enhancing Multilingual LLM Pretraining with Model-Based Data Selection , author=. 2025 , eprint=

2025
[7]

2025 , eprint=

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language , author=. 2025 , eprint=

2025
[8]

2024 , eprint=

MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions , author=. 2024 , eprint=

2024
[9]

2024 , eprint=

Tagengo: A Multilingual Chat Dataset , author=. 2024 , eprint=

2024
[10]

2024 , eprint=

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. 2024 , eprint=

2024
[11]

2020 , eprint=

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , author=. 2020 , eprint=

2020
[12]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[13]

2023 , eprint=

MADLAD-400: A Multilingual And Document-Level Large Audited Dataset , author=. 2023 , eprint=

2023
[14]

2023 , eprint=

CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages , author=. 2023 , eprint=

2023
[15]

2023 , eprint=

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only , author=. 2023 , eprint=

2023
[16]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =
[17]

2019 , eprint=

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , author=. 2019 , eprint=

2019
[18]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

2024
[19]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

2021
[20]

FineTasks: Finding signal in a haystack of 200+ multilingual tasks , author=
[21]

2024 , eprint=

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations , author=. 2024 , eprint=

2024
[22]

2020 , eprint=

Unsupervised Cross-lingual Representation Learning at Scale , author=. 2020 , eprint=

2020
[23]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602
[24]

2024 , eprint=

The AdEMAMix Optimizer: Better, Faster, Older , author=. 2024 , eprint=

2024
[25]

2025 , eprint=

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments , author=. 2025 , eprint=

2025
[26]

2025 , archivePrefix=

EuroLLM-9B: Technical Report , author=. 2025 , archivePrefix=

2025
[27]

Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin

Xi Victoria Lin and Todor Mihaylov and Mikel Artetxe and Tianlu Wang and Shuohui Chen and Daniel Simig and Myle Ott and Naman Goyal and Shruti Bhosale and Jingfei Du and Ramakanth Pasunuru and Sam Shleifer and Punit Singh Koura and Vishrav Chaudhary and Brian O'Horo and Jeff Wang and Luke Zettlemoyer and Zornitsa Kozareva and Mona T. Diab and Veselin Stoy...

work page arXiv 2021
[28]

2019 , eprint=

PIQA: Reasoning about Physical Commonsense in Natural Language , author=. 2019 , eprint=

2019
[29]

2023 , publisher =

OALL , title =. 2023 , publisher =

2023
[30]

A l G hafa Evaluation Benchmark for A rabic Language Models

Almazrouei, Ebtesam and Cojocaru, Ruxandra and Baldo, Michele and Malartic, Quentin and Alobeidli, Hamza and Mazzotta, Daniele and Penedo, Guilherme and Campesan, Giulia and Farooq, Mugariya and Alhammadi, Maitha and Launay, Julien and Noune, Badreddine. A l G hafa Evaluation Benchmark for A rabic Language Models. Proceedings of ArabicNLP 2023. 2023. doi:...

work page doi:10.18653/v1/2023.arabicnlp-1.21 2023
[31]

2022 , eprint=

Crosslingual Generalization through Multitask Finetuning , author=. 2022 , eprint=

2022
[32]

2021 , eprint=

It's All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning , author=. 2021 , eprint=

2021
[33]

arXiv e-prints , pages=

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback , author=. arXiv e-prints , pages=
[34]

doi:10.57967/hf/5618 , publisher =

Alexandra Institute , title =. doi:10.57967/hf/5618 , publisher =

work page doi:10.57967/hf/5618
[35]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=
[36]

2019 , eprint=

PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification , author=. 2019 , eprint=

2019
[37]

2018 , eprint=

XNLI: Evaluating Cross-lingual Sentence Representations , author=. 2018 , eprint=

2018
[38]

doi: 10.18653/v1/2024.acl-long.44

Bandarkar, Lucas and Liang, Davis and Muller, Benjamin and Artetxe, Mikel and Shukla, Satya Narayan and Husa, Donald and Goyal, Naman and Krishnan, Abhinandan and Zettlemoyer, Luke and Khabsa, Madian , year=. The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants , url=. doi:10.18653/v1/2024.acl-long.44 , booktitle=

work page doi:10.18653/v1/2024.acl-long.44 2024
[39]

2018 , eprint=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

2018
[40]

2025 , eprint=

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation , author=. 2025 , eprint=

2025
[41]

2024 , url =

OpenAI , title =. 2024 , url =

2024
[42]

arXiv preprint arXiv:2411.19799 , year=

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge , author=. arXiv preprint arXiv:2411.19799 , year=

work page arXiv
[43]

2023 , eprint=

OpenAssistant Conversations -- Democratizing Large Language Model Alignment , author=. 2023 , eprint=

2023
[44]

2024 , eprint=

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning , author=. 2024 , eprint=

2024
[45]

2025 , eprint=

DataComp-LM: In search of the next generation of training sets for language models , author=. 2025 , eprint=

2025
[46]

2016 , eprint=

Bag of Tricks for Efficient Text Classification , author=. 2016 , eprint=

2016
[47]

2023 , eprint=

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models , author=. 2023 , eprint=

2023
[48]

arXiv preprint arXiv:2305.08322 , year=

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , author=. arXiv preprint arXiv:2305.08322 , year=

work page arXiv
[49]

Ponti and Goran Glava

Edoardo M. Ponti and Goran Glava. arXiv preprint , year=
[50]

2011 AAAI Spring Symposium Series , year=

Choice of plausible alternatives: An evaluation of commonsense causal reasoning , author=. 2011 AAAI Spring Symposium Series , year=

2011
[51]

2024 , eprint=

CMMLU: Measuring massive multitask language understanding in Chinese , author=. 2024 , eprint=

2024
[52]

Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems

Ling, Wang and Yogatama, Dani and Dyer, Chris and Blunsom, Phil. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1015

work page doi:10.18653/v1/p17-1015 2017
[53]

NeurIPS , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=
[54]

International Joint Conference on Artificial Intelligence , year=

LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning , author=. International Joint Conference on Artificial Intelligence , year=
[55]

Proceedings of AAAI , year=

JEC-QA: A Legal-Domain Question Answering Dataset , author=. Proceedings of AAAI , year=
[56]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , year=

From LSAT: The Progress and Challenges of Complex Reasoning , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , year=
[57]

2023 , eprint=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2023 , eprint=

2023
[58]

2022 , eprint=

Scaling Language Models: Methods, Analysis & Insights from Training Gopher , author=. 2022 , eprint=

2022
[59]

2023 , eprint=

UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining , author=. 2023 , eprint=

2023