Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering

Mateusz Klimaszewski; Piotr Andruszkiewicz

arxiv: 2605.23721 · v1 · pith:VNMF227Snew · submitted 2026-05-21 · 💻 cs.CL

Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering

Mateusz Klimaszewski , Piotr Andruszkiewicz This is my paper

Pith reviewed 2026-05-25 06:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords quality filteringclassifier-based filteringpre-training corporaFineWeb-EduWikipedia-style reformattingdata curationLLM training data

0 comments

The pith

A simple Wikipedia-style reformatting operation can reverse a quality classifier's filtering decision on roughly 7 percent of documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that classifier-based quality filtering for LLM pre-training data is sensitive to document presentation rather than content alone. Applying a straightforward reformatting step that mimics Wikipedia style changes the model's quality score enough to flip the accept-or-reject outcome for a measurable share of documents. This allows material that would otherwise be filtered out to enter the training corpus. A reader would care because current pipelines rely on these single models to replace older heuristics, so any format-based loophole directly affects what ends up in the data used to train large models.

Core claim

Our analysis reveals that the FineWeb-Edu CQF model would reverse its filtering decision for approximately 7% of evaluated documents, thereby admitting content into the pre-training corpus that would otherwise have been excluded.

What carries the argument

The quality classifier's sensitivity to superficial stylistic reformatting that mimics Wikipedia presentation.

If this is right

Low-quality documents can pass into pre-training data when presented in a Wikipedia-like format.
Single-model classifier filters may need additional checks that ignore surface presentation.
Heuristic-based filtering could remain useful as a complement to avoid format-dependent errors.
Corpus construction pipelines become more brittle if they rely solely on one classifier without format-robustness tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Filters may be learning stylistic cues that are easy to manipulate rather than deeper indicators of educational value.
Similar format-based attacks could affect other classifier-driven curation steps beyond quality filtering.
Testing reformatting robustness should become a standard validation step for any new quality model.

Load-bearing premise

The quality classifier's decisions depend primarily on substantive content properties rather than superficial document formatting and stylistic presentation.

What would settle it

Run the classifier on the same documents before and after the Wikipedia-style reformatting and check whether the flipped decisions correlate with independent measures of actual content quality.

Figures

Figures reproduced from arXiv: 2605.23721 by Mateusz Klimaszewski, Piotr Andruszkiewicz.

**Figure 2.** Figure 2: Wikipedia-style rephrasing impact on the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Wikipedia-style rephrasing impact on the FineWeb-Edu CQF model across domains. In each of 26 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Educational prompt listing. You are a Wikipedia-style rephraser. Your objective is to rephrase the following web page to imitate a Wikipedia article. Web page: ```{web_page}``` Follow the rules below during rephrasing: - Focus on containing all the facts from the document, even if they are not essential. - Do not include new facts, concepts and overall new content. - Keep the exact dates, locations, names … view at source ↗

**Figure 5.** Figure 5: Wikipedia-style rephrasing prompt listing. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Wikipedia-style rephrasing example #1. The transformation reduced CQF score by 0.02 (4.17 vs 4.15). [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Wikipedia-style rephrasing example #2. The transformation increased CQF score by 2.24 (0.52 vs 2.76). [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Wikipedia-style rephrasing example #3. The transformation increased CQF score by 1.83 (-0.04 vs 1.79). [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

Classifier-based Quality Filtering has recently emerged as a fundamental technique in constructing pre-training corpora. The ability to deploy a single model that can replace or supplement a set of heuristics has proven effective across numerous Large Language Models. In this work, we expose a critical vulnerability in this approach by demonstrating how a straightforward Wikipedia-style reformatting operation can substantially alter a model's quality assessment and enable low-quality content to surpass filtering thresholds. Our analysis reveals that the FineWeb-Edu CQF model would reverse its filtering decision for approximately 7% of evaluated documents, thereby admitting content into the pre-training corpus that would otherwise have been excluded.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 7% reversal rate under Wikipedia-style reformatting flags a real sensitivity in CQF models, but the abstract supplies no method details or controls to confirm the content stayed low-quality.

read the letter

The paper's core observation is that applying a Wikipedia-style reformatting to documents flips the FineWeb-Edu classifier's keep/reject decision for roughly 7% of cases. This is a specific empirical signal worth tracking for anyone who uses these models to build pre-training corpora. It directly challenges the assumption that the classifier is mostly responding to substantive educational merit rather than surface presentation. That kind of targeted test is new relative to the earlier CQF papers it cites and gives practitioners a concrete number to worry about. The work is also honest about the stakes: if the flip admits genuinely weak material, it affects downstream model training at scale. Credit for surfacing the issue cleanly from the abstract alone. The main weakness is that the claim rests on an untested assumption. The abstract does not describe the exact reformatting steps, the size or source of the evaluated set, any statistical tests around the 7% figure, or an independent check that the operation left educational value unchanged. The stress-test concern lands here: without a separate quality metric or comparison to actual Wikipedia pages, it is possible the reformatting added real structure that improved the document on its own terms. That gap makes the interpretation that low-quality content is being admitted harder to evaluate. No circularity or invented entities appear in what is shown. The paper is for people who curate or filter web-scale data for LLMs and who already know the basic CQF literature. A reader who needs a quick warning flag will get value; someone looking for a fully supported result will not. It deserves a serious referee because the topic is central to current training pipelines and the observation, if replicated with proper controls, would be actionable. I would send it out for review rather than desk reject, with the expectation that the authors add the missing methodology and validation steps.

Referee Report

2 major / 0 minor

Summary. The paper claims that classifier-based quality filtering (CQF) for pre-training corpora is vulnerable to superficial changes: a straightforward Wikipedia-style reformatting operation alters the FineWeb-Edu CQF model's quality assessment, reversing its filtering decision for approximately 7% of evaluated documents and thereby admitting content that would otherwise have been excluded.

Significance. If substantiated with rigorous controls, the result would be significant for the field because CQF has become a core technique for curating large-scale pre-training data (replacing or supplementing heuristics). Demonstrating sensitivity to formatting rather than content would imply that current CQF pipelines risk systematic inclusion of low-quality material and would motivate development of more robust, content-focused quality metrics.

major comments (2)

[Abstract] Abstract: the 7% reversal claim is presented without any description of the dataset size, sampling method, exact reformatting procedure, statistical controls, or confidence intervals, so the empirical support for the central observation cannot be evaluated.
[Abstract] Abstract: the claim that reformatting admits low-quality content requires that the operation affects only superficial formatting/stylistics while leaving substantive educational merit unchanged, yet the manuscript supplies no independent quality metric (human ratings, comparison to real educational documents, or content analysis) to test this assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments on the abstract. We address each point below and will revise the manuscript accordingly to strengthen the presentation of our empirical results.

read point-by-point responses

Referee: [Abstract] Abstract: the 7% reversal claim is presented without any description of the dataset size, sampling method, exact reformatting procedure, statistical controls, or confidence intervals, so the empirical support for the central observation cannot be evaluated.

Authors: We agree that the abstract would benefit from additional context on the evaluation setup. The full manuscript describes the dataset (a large random sample drawn from FineWeb), the sampling procedure, the precise Wikipedia-style reformatting steps, and the statistical analysis with confidence intervals. We will revise the abstract to briefly state the evaluation scale and direct readers to the relevant methods section for the remaining details. revision: partial
Referee: [Abstract] Abstract: the claim that reformatting admits low-quality content requires that the operation affects only superficial formatting/stylistics while leaving substantive educational merit unchanged, yet the manuscript supplies no independent quality metric (human ratings, comparison to real educational documents, or content analysis) to test this assumption.

Authors: The reformatting operation is constructed to alter only document structure and presentation while leaving the underlying text unchanged; the manuscript includes illustrative examples of this preservation. We acknowledge that the current version does not include a separate human rating study or direct comparison against independently labeled educational documents to quantify unchanged merit. We will add a qualitative content analysis of reversed documents in the revised manuscript to better support the claim that substantive content is unaffected. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation of classifier sensitivity

full rationale

The paper reports an empirical experiment: applying a Wikipedia-style reformatting operation to documents and measuring the rate at which the FineWeb-Edu CQF model reverses its filtering decision (approximately 7%). No derivation, prediction, fitted parameter, or self-citation chain is present that reduces the central claim to its own inputs by construction. The result is a direct measurement of model behavior under a described transformation and does not rely on any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies insufficient detail to enumerate free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5630 in / 944 out tokens · 29062 ms · 2026-05-25T06:10:02.588660+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 4 internal anchors

[1]

Phi-3 technical report: A highly capa- ble language model locally on your phone.Preprint, arXiv:2404.14219. Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Noua- mane Tazi, Lewis Tunstall, Carlos Miguel Patiño, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gallouédec, Kashif Rasul, Nathan Habib, Clémentine Fourrier, Hynek Kydlicek, Guil- he...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

InProceedings of the Eighth Conference on Machine Translation, pages 629–653, Singapore

Findings of the WMT 2023 shared task on quality estimation. InProceedings of the Eighth Conference on Machine Translation, pages 629–653, Singapore. Association for Computational Linguistics. Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, and Kenneth Heafield

work page 2023
[3]

BERT: Pre-training of deep bidirectional transformers for language under- standing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Shizhe ...

work page 2019
[4]

The llama 3 herd of models.Preprint, arXiv:2407.21783. Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Textbooks Are All You Need

Textbooks are all you need. Preprint, arXiv:2306.11644. Pengcheng He, Jianfeng Gao, and Weizhu Chen

work page internal anchor Pith review Pith/arXiv arXiv
[6]

InProceedings of the Ninth Conference on Machine Translation, pages 492–504, Miami, Florida, USA

MetricX-24: The Google submission to the WMT 2024 metrics shared task. InProceedings of the Ninth Conference on Machine Translation, pages 492–504, Miami, Florida, USA. Association for Computational Linguistics. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Kumar Guha, Sedrick Keh, Kushal Arora,...

work page 2024
[7]

Luke Merrick

Eurollm-9b: Technical report.Preprint, arXiv:2506.04079. Luke Merrick

work page arXiv
[8]

NLLB Team, Marta R

Embedding and clustering your data can improve contrastive pretraining.Preprint, arXiv:2407.18887. NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef- fernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel ...

work page arXiv
[9]

InProceedings of the Eighth Conference on Machine Translation, pages 841–848, Singapore

Scaling up CometKiwi: Unbabel-IST 2023 submission for the quality estimation shared task. InProceedings of the Eighth Conference on Machine Translation, pages 841–848, Singapore. Association for Computational Linguistics. Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, José G. C. de Souza, Taisiya Glushkov...

work page 2023
[10]

In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid)

CometKiwi: IST-unbabel 2022 sub- mission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid). Association for Com- putational Linguistics. Thiziri Nait Saada, Louis Bethune, Michal Klein, David Grangier, Marco Cuturi, and Pierre Ablin

work page 2022
[11]

Lucia Specia, Frédéric Blain, Marina Fomicheva, Chrysoula Zerva, Zhenhao Li, Vishrav Chaudhary, and André F

The data-quality illusion: Rethinking classifier- based quality filtering for llm pretraining.Preprint, arXiv:2510.00866. Lucia Specia, Frédéric Blain, Marina Fomicheva, Chrysoula Zerva, Zhenhao Li, Vishrav Chaudhary, and André F. T. Martins

work page arXiv
[12]

InProceed- ings of the Sixth Conference on Machine Translation, pages 684–725, Online

Findings of the WMT 2021 shared task on quality estimation. InProceed- ings of the Sixth Conference on Machine Translation, pages 684–725, Online. Association for Computa- tional Linguistics. Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro

work page 2021
[13]

InFindings of the Association for Computa- tional Linguistics: EMNLP 2025, pages 9317–9334, Suzhou, China

Multilingual data filtering using synthetic data from large language models. InFindings of the Association for Computa- tional Linguistics: EMNLP 2025, pages 9317–9334, Suzhou, China. Association for Computational Lin- guistics. Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con- neau, Vishrav Chaudhary, Francisco Guzmán, Ar- mand Joulin, and Edouard Grave

work page 2025
[14]

Qwen2.5 Technical Report

Qwen2.5 technical report.CoRR, abs/2412.15115. Emmanouil Zaranis, Giuseppe Attanasio, Sweta Agrawal, and Andre Martins

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Association for Computational Linguistics

Findings of the quality estimation shared task at WMT 2024: Are LLMs closing the gap in QE? InProceedings of the Ninth Conference on Machine Translation, pages 82–109, Miami, Florida, USA. Association for Computational Linguistics. Chrysoula Zerva, Frédéric Blain, Ricardo Rei, Piyawat Lertvittayakumjorn, José G. C. de Souza, Steffen Eger, Diptesh Kanojia,...

work page 2024
[16]

InProceedings of the Seventh Conference on Machine Translation (WMT), pages 69–99, Abu Dhabi, United Arab Emi- rates (Hybrid)

Findings of the WMT 2022 shared task on quality estimation. InProceedings of the Seventh Conference on Machine Translation (WMT), pages 69–99, Abu Dhabi, United Arab Emi- rates (Hybrid). Association for Computational Lin- guistics. Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, and Daphne Ippolito

work page 2022

[1] [1]

Phi-3 technical report: A highly capa- ble language model locally on your phone.Preprint, arXiv:2404.14219. Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Noua- mane Tazi, Lewis Tunstall, Carlos Miguel Patiño, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gallouédec, Kashif Rasul, Nathan Habib, Clémentine Fourrier, Hynek Kydlicek, Guil- he...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

InProceedings of the Eighth Conference on Machine Translation, pages 629–653, Singapore

Findings of the WMT 2023 shared task on quality estimation. InProceedings of the Eighth Conference on Machine Translation, pages 629–653, Singapore. Association for Computational Linguistics. Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, and Kenneth Heafield

work page 2023

[3] [3]

BERT: Pre-training of deep bidirectional transformers for language under- standing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Shizhe ...

work page 2019

[4] [4]

The llama 3 herd of models.Preprint, arXiv:2407.21783. Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Textbooks Are All You Need

Textbooks are all you need. Preprint, arXiv:2306.11644. Pengcheng He, Jianfeng Gao, and Weizhu Chen

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

InProceedings of the Ninth Conference on Machine Translation, pages 492–504, Miami, Florida, USA

MetricX-24: The Google submission to the WMT 2024 metrics shared task. InProceedings of the Ninth Conference on Machine Translation, pages 492–504, Miami, Florida, USA. Association for Computational Linguistics. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Kumar Guha, Sedrick Keh, Kushal Arora,...

work page 2024

[7] [7]

Luke Merrick

Eurollm-9b: Technical report.Preprint, arXiv:2506.04079. Luke Merrick

work page arXiv

[8] [8]

NLLB Team, Marta R

Embedding and clustering your data can improve contrastive pretraining.Preprint, arXiv:2407.18887. NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef- fernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel ...

work page arXiv

[9] [9]

InProceedings of the Eighth Conference on Machine Translation, pages 841–848, Singapore

Scaling up CometKiwi: Unbabel-IST 2023 submission for the quality estimation shared task. InProceedings of the Eighth Conference on Machine Translation, pages 841–848, Singapore. Association for Computational Linguistics. Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, José G. C. de Souza, Taisiya Glushkov...

work page 2023

[10] [10]

In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid)

CometKiwi: IST-unbabel 2022 sub- mission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid). Association for Com- putational Linguistics. Thiziri Nait Saada, Louis Bethune, Michal Klein, David Grangier, Marco Cuturi, and Pierre Ablin

work page 2022

[11] [11]

Lucia Specia, Frédéric Blain, Marina Fomicheva, Chrysoula Zerva, Zhenhao Li, Vishrav Chaudhary, and André F

The data-quality illusion: Rethinking classifier- based quality filtering for llm pretraining.Preprint, arXiv:2510.00866. Lucia Specia, Frédéric Blain, Marina Fomicheva, Chrysoula Zerva, Zhenhao Li, Vishrav Chaudhary, and André F. T. Martins

work page arXiv

[12] [12]

InProceed- ings of the Sixth Conference on Machine Translation, pages 684–725, Online

Findings of the WMT 2021 shared task on quality estimation. InProceed- ings of the Sixth Conference on Machine Translation, pages 684–725, Online. Association for Computa- tional Linguistics. Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro

work page 2021

[13] [13]

InFindings of the Association for Computa- tional Linguistics: EMNLP 2025, pages 9317–9334, Suzhou, China

Multilingual data filtering using synthetic data from large language models. InFindings of the Association for Computa- tional Linguistics: EMNLP 2025, pages 9317–9334, Suzhou, China. Association for Computational Lin- guistics. Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con- neau, Vishrav Chaudhary, Francisco Guzmán, Ar- mand Joulin, and Edouard Grave

work page 2025

[14] [14]

Qwen2.5 Technical Report

Qwen2.5 technical report.CoRR, abs/2412.15115. Emmanouil Zaranis, Giuseppe Attanasio, Sweta Agrawal, and Andre Martins

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Association for Computational Linguistics

Findings of the quality estimation shared task at WMT 2024: Are LLMs closing the gap in QE? InProceedings of the Ninth Conference on Machine Translation, pages 82–109, Miami, Florida, USA. Association for Computational Linguistics. Chrysoula Zerva, Frédéric Blain, Ricardo Rei, Piyawat Lertvittayakumjorn, José G. C. de Souza, Steffen Eger, Diptesh Kanojia,...

work page 2024

[16] [16]

InProceedings of the Seventh Conference on Machine Translation (WMT), pages 69–99, Abu Dhabi, United Arab Emi- rates (Hybrid)

Findings of the WMT 2022 shared task on quality estimation. InProceedings of the Seventh Conference on Machine Translation (WMT), pages 69–99, Abu Dhabi, United Arab Emi- rates (Hybrid). Association for Computational Lin- guistics. Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, and Daphne Ippolito

work page 2022