Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering
Pith reviewed 2026-05-25 06:10 UTC · model grok-4.3
The pith
A simple Wikipedia-style reformatting operation can reverse a quality classifier's filtering decision on roughly 7 percent of documents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our analysis reveals that the FineWeb-Edu CQF model would reverse its filtering decision for approximately 7% of evaluated documents, thereby admitting content into the pre-training corpus that would otherwise have been excluded.
What carries the argument
The quality classifier's sensitivity to superficial stylistic reformatting that mimics Wikipedia presentation.
If this is right
- Low-quality documents can pass into pre-training data when presented in a Wikipedia-like format.
- Single-model classifier filters may need additional checks that ignore surface presentation.
- Heuristic-based filtering could remain useful as a complement to avoid format-dependent errors.
- Corpus construction pipelines become more brittle if they rely solely on one classifier without format-robustness tests.
Where Pith is reading between the lines
- Filters may be learning stylistic cues that are easy to manipulate rather than deeper indicators of educational value.
- Similar format-based attacks could affect other classifier-driven curation steps beyond quality filtering.
- Testing reformatting robustness should become a standard validation step for any new quality model.
Load-bearing premise
The quality classifier's decisions depend primarily on substantive content properties rather than superficial document formatting and stylistic presentation.
What would settle it
Run the classifier on the same documents before and after the Wikipedia-style reformatting and check whether the flipped decisions correlate with independent measures of actual content quality.
Figures
read the original abstract
Classifier-based Quality Filtering has recently emerged as a fundamental technique in constructing pre-training corpora. The ability to deploy a single model that can replace or supplement a set of heuristics has proven effective across numerous Large Language Models. In this work, we expose a critical vulnerability in this approach by demonstrating how a straightforward Wikipedia-style reformatting operation can substantially alter a model's quality assessment and enable low-quality content to surpass filtering thresholds. Our analysis reveals that the FineWeb-Edu CQF model would reverse its filtering decision for approximately 7% of evaluated documents, thereby admitting content into the pre-training corpus that would otherwise have been excluded.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that classifier-based quality filtering (CQF) for pre-training corpora is vulnerable to superficial changes: a straightforward Wikipedia-style reformatting operation alters the FineWeb-Edu CQF model's quality assessment, reversing its filtering decision for approximately 7% of evaluated documents and thereby admitting content that would otherwise have been excluded.
Significance. If substantiated with rigorous controls, the result would be significant for the field because CQF has become a core technique for curating large-scale pre-training data (replacing or supplementing heuristics). Demonstrating sensitivity to formatting rather than content would imply that current CQF pipelines risk systematic inclusion of low-quality material and would motivate development of more robust, content-focused quality metrics.
major comments (2)
- [Abstract] Abstract: the 7% reversal claim is presented without any description of the dataset size, sampling method, exact reformatting procedure, statistical controls, or confidence intervals, so the empirical support for the central observation cannot be evaluated.
- [Abstract] Abstract: the claim that reformatting admits low-quality content requires that the operation affects only superficial formatting/stylistics while leaving substantive educational merit unchanged, yet the manuscript supplies no independent quality metric (human ratings, comparison to real educational documents, or content analysis) to test this assumption.
Simulated Author's Rebuttal
We thank the referee for these constructive comments on the abstract. We address each point below and will revise the manuscript accordingly to strengthen the presentation of our empirical results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the 7% reversal claim is presented without any description of the dataset size, sampling method, exact reformatting procedure, statistical controls, or confidence intervals, so the empirical support for the central observation cannot be evaluated.
Authors: We agree that the abstract would benefit from additional context on the evaluation setup. The full manuscript describes the dataset (a large random sample drawn from FineWeb), the sampling procedure, the precise Wikipedia-style reformatting steps, and the statistical analysis with confidence intervals. We will revise the abstract to briefly state the evaluation scale and direct readers to the relevant methods section for the remaining details. revision: partial
-
Referee: [Abstract] Abstract: the claim that reformatting admits low-quality content requires that the operation affects only superficial formatting/stylistics while leaving substantive educational merit unchanged, yet the manuscript supplies no independent quality metric (human ratings, comparison to real educational documents, or content analysis) to test this assumption.
Authors: The reformatting operation is constructed to alter only document structure and presentation while leaving the underlying text unchanged; the manuscript includes illustrative examples of this preservation. We acknowledge that the current version does not include a separate human rating study or direct comparison against independently labeled educational documents to quantify unchanged merit. We will add a qualitative content analysis of reversed documents in the revised manuscript to better support the claim that substantive content is unaffected. revision: yes
Circularity Check
No circularity: empirical observation of classifier sensitivity
full rationale
The paper reports an empirical experiment: applying a Wikipedia-style reformatting operation to documents and measuring the rate at which the FineWeb-Edu CQF model reverses its filtering decision (approximately 7%). No derivation, prediction, fitted parameter, or self-citation chain is present that reduces the central claim to its own inputs by construction. The result is a direct measurement of model behavior under a described transformation and does not rely on any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Phi-3 technical report: A highly capa- ble language model locally on your phone.Preprint, arXiv:2404.14219. Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Noua- mane Tazi, Lewis Tunstall, Carlos Miguel Patiño, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gallouédec, Kashif Rasul, Nathan Habib, Clémentine Fourrier, Hynek Kydlicek, Guil- he...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
InProceedings of the Eighth Conference on Machine Translation, pages 629–653, Singapore
Findings of the WMT 2023 shared task on quality estimation. InProceedings of the Eighth Conference on Machine Translation, pages 629–653, Singapore. Association for Computational Linguistics. Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, and Kenneth Heafield
work page 2023
-
[3]
BERT: Pre-training of deep bidirectional transformers for language under- standing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Shizhe ...
work page 2019
-
[4]
The llama 3 herd of models.Preprint, arXiv:2407.21783. Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Textbooks are all you need. Preprint, arXiv:2306.11644. Pengcheng He, Jianfeng Gao, and Weizhu Chen
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
InProceedings of the Ninth Conference on Machine Translation, pages 492–504, Miami, Florida, USA
MetricX-24: The Google submission to the WMT 2024 metrics shared task. InProceedings of the Ninth Conference on Machine Translation, pages 492–504, Miami, Florida, USA. Association for Computational Linguistics. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Kumar Guha, Sedrick Keh, Kushal Arora,...
work page 2024
- [7]
-
[8]
Embedding and clustering your data can improve contrastive pretraining.Preprint, arXiv:2407.18887. NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef- fernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel ...
-
[9]
InProceedings of the Eighth Conference on Machine Translation, pages 841–848, Singapore
Scaling up CometKiwi: Unbabel-IST 2023 submission for the quality estimation shared task. InProceedings of the Eighth Conference on Machine Translation, pages 841–848, Singapore. Association for Computational Linguistics. Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, José G. C. de Souza, Taisiya Glushkov...
work page 2023
-
[10]
CometKiwi: IST-unbabel 2022 sub- mission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid). Association for Com- putational Linguistics. Thiziri Nait Saada, Louis Bethune, Michal Klein, David Grangier, Marco Cuturi, and Pierre Ablin
work page 2022
-
[11]
The data-quality illusion: Rethinking classifier- based quality filtering for llm pretraining.Preprint, arXiv:2510.00866. Lucia Specia, Frédéric Blain, Marina Fomicheva, Chrysoula Zerva, Zhenhao Li, Vishrav Chaudhary, and André F. T. Martins
-
[12]
InProceed- ings of the Sixth Conference on Machine Translation, pages 684–725, Online
Findings of the WMT 2021 shared task on quality estimation. InProceed- ings of the Sixth Conference on Machine Translation, pages 684–725, Online. Association for Computa- tional Linguistics. Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro
work page 2021
-
[13]
Multilingual data filtering using synthetic data from large language models. InFindings of the Association for Computa- tional Linguistics: EMNLP 2025, pages 9317–9334, Suzhou, China. Association for Computational Lin- guistics. Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con- neau, Vishrav Chaudhary, Francisco Guzmán, Ar- mand Joulin, and Edouard Grave
work page 2025
-
[14]
Qwen2.5 technical report.CoRR, abs/2412.15115. Emmanouil Zaranis, Giuseppe Attanasio, Sweta Agrawal, and Andre Martins
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Association for Computational Linguistics
Findings of the quality estimation shared task at WMT 2024: Are LLMs closing the gap in QE? InProceedings of the Ninth Conference on Machine Translation, pages 82–109, Miami, Florida, USA. Association for Computational Linguistics. Chrysoula Zerva, Frédéric Blain, Ricardo Rei, Piyawat Lertvittayakumjorn, José G. C. de Souza, Steffen Eger, Diptesh Kanojia,...
work page 2024
-
[16]
Findings of the WMT 2022 shared task on quality estimation. InProceedings of the Seventh Conference on Machine Translation (WMT), pages 69–99, Abu Dhabi, United Arab Emi- rates (Hybrid). Association for Computational Lin- guistics. Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, and Daphne Ippolito
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.