pith. sign in

arxiv: 2605.23721 · v1 · pith:VNMF227Snew · submitted 2026-05-21 · 💻 cs.CL

Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering

Pith reviewed 2026-05-25 06:10 UTC · model grok-4.3

classification 💻 cs.CL
keywords quality filteringclassifier-based filteringpre-training corporaFineWeb-EduWikipedia-style reformattingdata curationLLM training data
0
0 comments X

The pith

A simple Wikipedia-style reformatting operation can reverse a quality classifier's filtering decision on roughly 7 percent of documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that classifier-based quality filtering for LLM pre-training data is sensitive to document presentation rather than content alone. Applying a straightforward reformatting step that mimics Wikipedia style changes the model's quality score enough to flip the accept-or-reject outcome for a measurable share of documents. This allows material that would otherwise be filtered out to enter the training corpus. A reader would care because current pipelines rely on these single models to replace older heuristics, so any format-based loophole directly affects what ends up in the data used to train large models.

Core claim

Our analysis reveals that the FineWeb-Edu CQF model would reverse its filtering decision for approximately 7% of evaluated documents, thereby admitting content into the pre-training corpus that would otherwise have been excluded.

What carries the argument

The quality classifier's sensitivity to superficial stylistic reformatting that mimics Wikipedia presentation.

If this is right

  • Low-quality documents can pass into pre-training data when presented in a Wikipedia-like format.
  • Single-model classifier filters may need additional checks that ignore surface presentation.
  • Heuristic-based filtering could remain useful as a complement to avoid format-dependent errors.
  • Corpus construction pipelines become more brittle if they rely solely on one classifier without format-robustness tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Filters may be learning stylistic cues that are easy to manipulate rather than deeper indicators of educational value.
  • Similar format-based attacks could affect other classifier-driven curation steps beyond quality filtering.
  • Testing reformatting robustness should become a standard validation step for any new quality model.

Load-bearing premise

The quality classifier's decisions depend primarily on substantive content properties rather than superficial document formatting and stylistic presentation.

What would settle it

Run the classifier on the same documents before and after the Wikipedia-style reformatting and check whether the flipped decisions correlate with independent measures of actual content quality.

Figures

Figures reproduced from arXiv: 2605.23721 by Mateusz Klimaszewski, Piotr Andruszkiewicz.

Figure 1
Figure 1. Figure 1: An educational classifier can be manipulated [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Wikipedia-style rephrasing impact on the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Wikipedia-style rephrasing impact on the FineWeb-Edu CQF model across domains. In each of 26 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Educational prompt listing. You are a Wikipedia-style rephraser. Your objective is to rephrase the following web page to imitate a Wikipedia article. Web page: ```{web_page}``` Follow the rules below during rephrasing: - Focus on containing all the facts from the document, even if they are not essential. - Do not include new facts, concepts and overall new content. - Keep the exact dates, locations, names … view at source ↗
Figure 5
Figure 5. Figure 5: Wikipedia-style rephrasing prompt listing. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Wikipedia-style rephrasing example #1. The transformation reduced CQF score by 0.02 (4.17 vs 4.15). [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Wikipedia-style rephrasing example #2. The transformation increased CQF score by 2.24 (0.52 vs 2.76). [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Wikipedia-style rephrasing example #3. The transformation increased CQF score by 1.83 (-0.04 vs 1.79). [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
read the original abstract

Classifier-based Quality Filtering has recently emerged as a fundamental technique in constructing pre-training corpora. The ability to deploy a single model that can replace or supplement a set of heuristics has proven effective across numerous Large Language Models. In this work, we expose a critical vulnerability in this approach by demonstrating how a straightforward Wikipedia-style reformatting operation can substantially alter a model's quality assessment and enable low-quality content to surpass filtering thresholds. Our analysis reveals that the FineWeb-Edu CQF model would reverse its filtering decision for approximately 7% of evaluated documents, thereby admitting content into the pre-training corpus that would otherwise have been excluded.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that classifier-based quality filtering (CQF) for pre-training corpora is vulnerable to superficial changes: a straightforward Wikipedia-style reformatting operation alters the FineWeb-Edu CQF model's quality assessment, reversing its filtering decision for approximately 7% of evaluated documents and thereby admitting content that would otherwise have been excluded.

Significance. If substantiated with rigorous controls, the result would be significant for the field because CQF has become a core technique for curating large-scale pre-training data (replacing or supplementing heuristics). Demonstrating sensitivity to formatting rather than content would imply that current CQF pipelines risk systematic inclusion of low-quality material and would motivate development of more robust, content-focused quality metrics.

major comments (2)
  1. [Abstract] Abstract: the 7% reversal claim is presented without any description of the dataset size, sampling method, exact reformatting procedure, statistical controls, or confidence intervals, so the empirical support for the central observation cannot be evaluated.
  2. [Abstract] Abstract: the claim that reformatting admits low-quality content requires that the operation affects only superficial formatting/stylistics while leaving substantive educational merit unchanged, yet the manuscript supplies no independent quality metric (human ratings, comparison to real educational documents, or content analysis) to test this assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments on the abstract. We address each point below and will revise the manuscript accordingly to strengthen the presentation of our empirical results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the 7% reversal claim is presented without any description of the dataset size, sampling method, exact reformatting procedure, statistical controls, or confidence intervals, so the empirical support for the central observation cannot be evaluated.

    Authors: We agree that the abstract would benefit from additional context on the evaluation setup. The full manuscript describes the dataset (a large random sample drawn from FineWeb), the sampling procedure, the precise Wikipedia-style reformatting steps, and the statistical analysis with confidence intervals. We will revise the abstract to briefly state the evaluation scale and direct readers to the relevant methods section for the remaining details. revision: partial

  2. Referee: [Abstract] Abstract: the claim that reformatting admits low-quality content requires that the operation affects only superficial formatting/stylistics while leaving substantive educational merit unchanged, yet the manuscript supplies no independent quality metric (human ratings, comparison to real educational documents, or content analysis) to test this assumption.

    Authors: The reformatting operation is constructed to alter only document structure and presentation while leaving the underlying text unchanged; the manuscript includes illustrative examples of this preservation. We acknowledge that the current version does not include a separate human rating study or direct comparison against independently labeled educational documents to quantify unchanged merit. We will add a qualitative content analysis of reversed documents in the revised manuscript to better support the claim that substantive content is unaffected. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation of classifier sensitivity

full rationale

The paper reports an empirical experiment: applying a Wikipedia-style reformatting operation to documents and measuring the rate at which the FineWeb-Edu CQF model reverses its filtering decision (approximately 7%). No derivation, prediction, fitted parameter, or self-citation chain is present that reduces the central claim to its own inputs by construction. The result is a direct measurement of model behavior under a described transformation and does not rely on any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies insufficient detail to enumerate free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5630 in / 944 out tokens · 29062 ms · 2026-05-25T06:10:02.588660+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 4 internal anchors

  1. [1]

    Phi-3 technical report: A highly capa- ble language model locally on your phone.Preprint, arXiv:2404.14219. Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Noua- mane Tazi, Lewis Tunstall, Carlos Miguel Patiño, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gallouédec, Kashif Rasul, Nathan Habib, Clémentine Fourrier, Hynek Kydlicek, Guil- he...

  2. [2]

    InProceedings of the Eighth Conference on Machine Translation, pages 629–653, Singapore

    Findings of the WMT 2023 shared task on quality estimation. InProceedings of the Eighth Conference on Machine Translation, pages 629–653, Singapore. Association for Computational Linguistics. Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, and Kenneth Heafield

  3. [3]

    BERT: Pre-training of deep bidirectional transformers for language under- standing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Shizhe ...

  4. [4]

    The llama 3 herd of models.Preprint, arXiv:2407.21783. Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li

  5. [5]

    Textbooks Are All You Need

    Textbooks are all you need. Preprint, arXiv:2306.11644. Pengcheng He, Jianfeng Gao, and Weizhu Chen

  6. [6]

    InProceedings of the Ninth Conference on Machine Translation, pages 492–504, Miami, Florida, USA

    MetricX-24: The Google submission to the WMT 2024 metrics shared task. InProceedings of the Ninth Conference on Machine Translation, pages 492–504, Miami, Florida, USA. Association for Computational Linguistics. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Kumar Guha, Sedrick Keh, Kushal Arora,...

  7. [7]

    Luke Merrick

    Eurollm-9b: Technical report.Preprint, arXiv:2506.04079. Luke Merrick

  8. [8]

    NLLB Team, Marta R

    Embedding and clustering your data can improve contrastive pretraining.Preprint, arXiv:2407.18887. NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef- fernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel ...

  9. [9]

    InProceedings of the Eighth Conference on Machine Translation, pages 841–848, Singapore

    Scaling up CometKiwi: Unbabel-IST 2023 submission for the quality estimation shared task. InProceedings of the Eighth Conference on Machine Translation, pages 841–848, Singapore. Association for Computational Linguistics. Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, José G. C. de Souza, Taisiya Glushkov...

  10. [10]

    In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid)

    CometKiwi: IST-unbabel 2022 sub- mission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid). Association for Com- putational Linguistics. Thiziri Nait Saada, Louis Bethune, Michal Klein, David Grangier, Marco Cuturi, and Pierre Ablin

  11. [11]

    Lucia Specia, Frédéric Blain, Marina Fomicheva, Chrysoula Zerva, Zhenhao Li, Vishrav Chaudhary, and André F

    The data-quality illusion: Rethinking classifier- based quality filtering for llm pretraining.Preprint, arXiv:2510.00866. Lucia Specia, Frédéric Blain, Marina Fomicheva, Chrysoula Zerva, Zhenhao Li, Vishrav Chaudhary, and André F. T. Martins

  12. [12]

    InProceed- ings of the Sixth Conference on Machine Translation, pages 684–725, Online

    Findings of the WMT 2021 shared task on quality estimation. InProceed- ings of the Sixth Conference on Machine Translation, pages 684–725, Online. Association for Computa- tional Linguistics. Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro

  13. [13]

    InFindings of the Association for Computa- tional Linguistics: EMNLP 2025, pages 9317–9334, Suzhou, China

    Multilingual data filtering using synthetic data from large language models. InFindings of the Association for Computa- tional Linguistics: EMNLP 2025, pages 9317–9334, Suzhou, China. Association for Computational Lin- guistics. Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con- neau, Vishrav Chaudhary, Francisco Guzmán, Ar- mand Joulin, and Edouard Grave

  14. [14]

    Qwen2.5 Technical Report

    Qwen2.5 technical report.CoRR, abs/2412.15115. Emmanouil Zaranis, Giuseppe Attanasio, Sweta Agrawal, and Andre Martins

  15. [15]

    Association for Computational Linguistics

    Findings of the quality estimation shared task at WMT 2024: Are LLMs closing the gap in QE? InProceedings of the Ninth Conference on Machine Translation, pages 82–109, Miami, Florida, USA. Association for Computational Linguistics. Chrysoula Zerva, Frédéric Blain, Ricardo Rei, Piyawat Lertvittayakumjorn, José G. C. de Souza, Steffen Eger, Diptesh Kanojia,...

  16. [16]

    InProceedings of the Seventh Conference on Machine Translation (WMT), pages 69–99, Abu Dhabi, United Arab Emi- rates (Hybrid)

    Findings of the WMT 2022 shared task on quality estimation. InProceedings of the Seventh Conference on Machine Translation (WMT), pages 69–99, Abu Dhabi, United Arab Emi- rates (Hybrid). Association for Computational Lin- guistics. Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, and Daphne Ippolito