WCXB: A Multi-Type Web Content Extraction Benchmark

Murrough Foley

arxiv: 2605.21097 · v1 · pith:JPRDV4QHnew · submitted 2026-05-20 · 💻 cs.CL

WCXB: A Multi-Type Web Content Extraction Benchmark

Murrough Foley This is my paper

Pith reviewed 2026-05-21 04:54 UTC · model grok-4.3

classification 💻 cs.CL

keywords web content extractionbenchmark datasetboilerplate removalpage type classificationinformation retrievalnatural language processingweb data mining

0 comments

The pith

Web extractors reach high accuracy on news articles but show large performance gaps on forums, products, and other page types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a new benchmark dataset for extracting main content from web pages. It includes pages from seven different categories, going beyond the news articles that previous benchmarks focused on. Evaluation of various extraction methods reveals that top performers do consistently well only on articles, while struggling more with structured content. This matters because accurate content extraction is key for many downstream tasks like search and AI training. The new dataset allows for more comprehensive testing of these systems.

Core claim

The authors introduce the Web Content Extraction Benchmark consisting of 2,008 pages from 1,613 domains across seven page types. They show through systematic evaluation that while leading systems achieve an F1 score of 0.93 on articles, scores range from 0.41 to 0.84 on other structured page types, exposing weaknesses that article-only benchmarks miss.

What carries the argument

The multi-type web page dataset with ground truth annotations produced by a five-stage pipeline including LLM assistance and human review.

If this is right

Existing evaluation practices underestimate the difficulty of extracting content from non-article web pages.
Development of extraction systems should prioritize handling of forums, product pages, and similar structures.
Applications relying on web content like retrieval-augmented generation may need to incorporate type-specific handling.
Future benchmarks should adopt similar multi-type coverage to better reflect real-world web diversity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improved extraction across page types could enhance the quality of large-scale web data used for training language models.
Search engines and content aggregators might see better results by adopting methods tuned to this benchmark's findings.
Researchers could extend this work by testing new neural architectures specifically on the structured page subsets.

Load-bearing premise

The ground truth labels created by the five-stage annotation process correctly identify the main content on all seven page types.

What would settle it

An independent annotation effort on the test set that produces substantially different main content labels for a significant portion of pages.

read the original abstract

Web content extraction - isolating a page's main content from surrounding boilerplate - is a prerequisite for search indexing, retrieval-augmented generation, NLP dataset construction, and large language model training. Progress in this area has been constrained by the limitations of existing evaluation benchmarks, which are small (100-800 pages), restricted to news articles, or based on web pages from over a decade ago. We introduce the Web Content Extraction Benchmark (WCXB), a dataset of 2,008 web pages from 1,613 domains spanning seven structurally distinct page types: articles, forums, products, collections, listings, documentation, and service pages. The dataset includes a 1,497-page development set and a 511-page held-out test set with matched page type distributions. Ground truth annotations were produced through a five-stage pipeline: LLM-assisted drafting, automated verification, four-pass frontier model review, snippet and quality verification scripts, and human review. We evaluate 13 extraction systems - 11 heuristic and 2 neural - and find that while top systems converge on articles (F1 = 0.93), performance diverges sharply on structured page types (F1 = 0.41-0.84), revealing blind spots invisible to existing article-only benchmarks. The dataset is released under CC-BY-4.0 with HTML source files, ground truth annotations, page type labels, and baseline results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WCXB adds a useful multi-type dataset and shows extractors struggle more outside articles, but annotation reliability on complex pages is the main open question.

read the letter

The main thing to know is that this paper releases a new benchmark of 2008 pages across seven page types with a held-out test split, and the evaluations indicate that current systems do fine on articles but show more variation on forums, products, listings, and the rest. That performance gap is the clearest signal in the work. The dataset comes with full HTML and annotations under CC-BY, which is the practical part that lets others test their own tools directly. They ran 13 systems including heuristics and a couple of neural ones, and the numbers line up with the idea that article-only benchmarks have been hiding weaknesses. That is a straightforward empirical contribution and worth having in the literature. The annotation pipeline is described as five stages with LLM help, automated checks, model reviews, scripts, and human oversight. Without inter-annotator agreement numbers or per-type error rates in the abstract, though, it is hard to rule out that some of the spread on structured pages comes from label noise rather than extractor limits. The concern is real but not fatal; the dataset release still stands even if the baselines need tighter validation. This is for people working on web scraping, RAG pipelines, or training data cleaning who need broader test coverage than the old news-only sets. It deserves a serious referee because the data itself is new and usable, even if the paper would benefit from more quantitative checks on the ground truth. I would send it to review and ask specifically for agreement statistics and any selection details on the 2008 pages.

Referee Report

1 major / 2 minor

Summary. The paper introduces WCXB, a benchmark dataset of 2,008 web pages from 1,613 domains spanning seven page types (articles, forums, products, collections, listings, documentation, service pages), with a 1,497-page development set and 511-page held-out test set. Ground truth is created via a five-stage pipeline (LLM-assisted drafting, automated verification, four-pass frontier model review, snippet/quality scripts, human review). The authors evaluate 13 systems (11 heuristic, 2 neural) and report that top systems achieve F1=0.93 on articles but show sharp divergence (F1 0.41-0.84) on the other six types, arguing this exposes blind spots missed by prior article-only benchmarks. The dataset is released under CC-BY-4.0 with HTML sources, annotations, page-type labels, and baselines.

Significance. If the ground-truth labels prove reliable, the work would be significant for the field: it provides the first large-scale, multi-type benchmark for web content extraction, directly addressing the narrow scope of prior datasets (100-800 pages, news-only, or decade-old). The explicit release of raw HTML, annotations, and matched splits supports reproducibility and downstream use in search, RAG, and LLM training. The empirical finding that performance converges on articles but diverges elsewhere is a concrete, falsifiable observation that could guide targeted improvements in both heuristic and neural extractors.

major comments (1)

[Annotation pipeline description (abstract and §3)] The central empirical claim—that performance divergence on structured page types reflects genuine system limitations rather than annotation artifacts—rests on the reliability of the five-stage ground-truth pipeline. No inter-annotator agreement scores, per-type error rates, or disagreement-resolution statistics are reported for the non-article categories (forums, listings, documentation). Without these quantitative checks, systematic bias on structurally complex pages could inflate the observed F1 spread (0.41-0.84).

minor comments (2)

[§3.1] The abstract states 2,008 pages and 1,613 domains; the manuscript should explicitly confirm these counts and the exact page-type distribution in both dev and test splits (Table 1 or §3.1).
[§4] Baseline system descriptions would benefit from a short table summarizing the 11 heuristics and 2 neural models (e.g., key parameters or public implementations) to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and for highlighting the importance of validating the annotation pipeline. We provide a point-by-point response to the major comment below.

read point-by-point responses

Referee: The central empirical claim—that performance divergence on structured page types reflects genuine system limitations rather than annotation artifacts—rests on the reliability of the five-stage ground-truth pipeline. No inter-annotator agreement scores, per-type error rates, or disagreement-resolution statistics are reported for the non-article categories (forums, listings, documentation). Without these quantitative checks, systematic bias on structurally complex pages could inflate the observed F1 spread (0.41-0.84).

Authors: We acknowledge that reporting inter-annotator agreement or similar quantitative reliability metrics would provide additional assurance regarding the ground truth quality, particularly for the more structurally complex page types. Our annotation pipeline relies on a combination of LLM-assisted initial drafting, multiple automated verification steps, four-pass reviews by frontier models, and a final human review to minimize errors. However, the process was not designed with multiple independent human annotators, which precludes traditional IAA calculations. In response to this comment, we will revise Section 3 to include a more detailed description of the human review stage, including the specific instructions given to the annotator and examples of common corrections. Additionally, we will report the proportion of pages per type that required significant edits during the human review as an indirect measure of annotation reliability. We believe these additions will help substantiate the robustness of our ground truth without overclaiming the availability of IAA scores. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent ground truth

full rationale

This is an empirical dataset release and system evaluation paper with no mathematical derivations, parameter fitting, or predictive modeling. Ground truth is produced via an external five-stage annotation pipeline and then used to measure F1 scores on held-out test pages; the reported performance differences are direct measurements rather than quantities derived from the paper's own inputs or self-citations. No steps reduce by construction to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper adds a new evaluation resource without introducing fitted parameters or new entities; its claims rest on the domain assumption that the described annotation pipeline yields reliable ground truth.

axioms (1)

domain assumption The five-stage annotation pipeline produces accurate ground truth for main content on all page types.
This assumption underpins every reported F1 score and the claim that prior benchmarks missed structured-page weaknesses.

pith-pipeline@v0.9.0 · 5771 in / 1296 out tokens · 40939 ms · 2026-05-21T04:54:49.793058+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ground truth annotations were produced through a five-stage pipeline: LLM-assisted drafting, automated verification, four-pass frontier model review, snippet and quality verification scripts, and human review.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

while top systems converge on articles (F1 = 0.93), performance diverges sharply on structured page types (F1 = 0.41-0.84)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

Kohlschütter, P

C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In WSDM, 2010

work page 2010
[2]

Readability, 2010.https://github.com/mozilla/readability

Mozilla. Readability, 2010.https://github.com/mozilla/readability

work page 2010
[3]

Pomikálek.Removing Boilerplate and Duplicate Content from Web Corpora

J. Pomikálek.Removing Boilerplate and Duplicate Content from Web Corpora. PhD thesis, Masaryk University, 2011

work page 2011
[4]

Barbaresi

A. Barbaresi. Trafilatura: A web scraping library and command-line tool for text discovery and re- trieval. InACL, 2021

work page 2021
[5]

Liu et al

M. Liu et al. Dripper: Token-efficient main HTML extraction with a lightweight LM. arXiv:2511.23119, 2025

work page arXiv 2025
[6]

ReaderLM-v2: HTML to markdown with a small language model.arXiv:2503.01151, 2025

Jina AI. ReaderLM-v2: HTML to markdown with a small language model.arXiv:2503.01151, 2025

work page arXiv 2025
[7]

Baroni, F

M. Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff. CleanEval: A competition for cleaning web pages. InLREC, 2008

work page 2008
[8]

M. E. Peters and D. Lecocq. Content extraction using diverse feature sets. InWWW, 2013

work page 2013
[9]

Article extraction benchmark, 2019.https://github.com/scrapinghub/a rticle-extraction-benchmark

ScrapingHub. Article extraction benchmark, 2019.https://github.com/scrapinghub/a rticle-extraction-benchmark

work page 2019
[10]

Leonhardt, A

J. Leonhardt, A. Anand, and M. Khosla. Boilerplate removal using a neural sequence labeling model. InWWW Companion, 2020

work page 2020
[11]

Bevendorff, S

J. Bevendorff, S. Gupta, J. Kiesel, and B. Stein. An empirical comparison of web content extraction algorithms. InSIGIR, 2023

work page 2023
[12]

N. McCurdy. dom-content-extraction, 2024.https://github.com/nickmccurdy/dom-c ontent-extraction. 11

work page 2024
[13]

Qi and B

X. Qi and B. D. Davison. Web page classification: Features and algorithms.ACM Computing Surveys, 2009

work page 2009
[14]

A. Broder. A taxonomy of web search.SIGIR F orum, 2002

work page 2002
[15]

Bevendorff et al

J. Bevendorff et al. Elastic ChatNoir: Search engine for the ClueWeb and the Common Crawl. In ECIR, 2018

work page 2018
[16]

M. Foley. rs-trafilatura: Page-type-aware web content extraction, 2026.https://crates.io/cr ates/rs-trafilatura

work page 2026
[17]

V ogels, O

T. V ogels, O. E. Ganea, and C. Eickhoff. Web2Text: Deep structured boilerplate removal. InECIR, 2018

work page 2018
[18]

Newspaper4k: Article scraping and curation, 2024.https://github .com/AndyTheFactory/newspaper4k

Newspaper4k Contributors. Newspaper4k: Article scraping and curation, 2024.https://github .com/AndyTheFactory/newspaper4k

work page 2024
[19]

Grangier, T

D. Grangier, T. Huynh, and M. Loesgen. Goose: Open source article extractor, 2013.https: //github.com/goose3/goose3

work page 2013
[20]

dom-smoothie: Fast content extraction, 2024.https://github.c om/nichochar/dom-smoothie

dom-smoothie Contributors. dom-smoothie: Fast content extraction, 2024.https://github.c om/nichochar/dom-smoothie

work page 2024
[21]

magic-html: Generalised HTML content extraction, 2024.https://github.com /opendatalab/magic-html

OpenDataLab. magic-html: Generalised HTML content extraction, 2024.https://github.com /opendatalab/magic-html

work page 2024
[22]

Bevendorff, M

J. Bevendorff, M. Potthast, and B. Stein. Resiliparse: A collection of robust and fast processing tools for parsing and analyzing web archive data, 2018–2024. Webis Group.https://resiliparse. chatnoir.eu/

work page 2018
[23]

J. Li, J. P. Gardner, D. Kang, F. Shi, K. Singh, C.-L. Li, H. Shandilya, D. L. W. Hall, O. Tuzel, P. Liang, L. Schmidt, H. Pouransari, and F. Faghri. Beyond a single extractor: Re-thinking HTML-to- text extraction for LLM pre-training. InFindings of EACL, 2026. 12

work page 2026

[1] [1]

Kohlschütter, P

C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In WSDM, 2010

work page 2010

[2] [2]

Readability, 2010.https://github.com/mozilla/readability

Mozilla. Readability, 2010.https://github.com/mozilla/readability

work page 2010

[3] [3]

Pomikálek.Removing Boilerplate and Duplicate Content from Web Corpora

J. Pomikálek.Removing Boilerplate and Duplicate Content from Web Corpora. PhD thesis, Masaryk University, 2011

work page 2011

[4] [4]

Barbaresi

A. Barbaresi. Trafilatura: A web scraping library and command-line tool for text discovery and re- trieval. InACL, 2021

work page 2021

[5] [5]

Liu et al

M. Liu et al. Dripper: Token-efficient main HTML extraction with a lightweight LM. arXiv:2511.23119, 2025

work page arXiv 2025

[6] [6]

ReaderLM-v2: HTML to markdown with a small language model.arXiv:2503.01151, 2025

Jina AI. ReaderLM-v2: HTML to markdown with a small language model.arXiv:2503.01151, 2025

work page arXiv 2025

[7] [7]

Baroni, F

M. Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff. CleanEval: A competition for cleaning web pages. InLREC, 2008

work page 2008

[8] [8]

M. E. Peters and D. Lecocq. Content extraction using diverse feature sets. InWWW, 2013

work page 2013

[9] [9]

Article extraction benchmark, 2019.https://github.com/scrapinghub/a rticle-extraction-benchmark

ScrapingHub. Article extraction benchmark, 2019.https://github.com/scrapinghub/a rticle-extraction-benchmark

work page 2019

[10] [10]

Leonhardt, A

J. Leonhardt, A. Anand, and M. Khosla. Boilerplate removal using a neural sequence labeling model. InWWW Companion, 2020

work page 2020

[11] [11]

Bevendorff, S

J. Bevendorff, S. Gupta, J. Kiesel, and B. Stein. An empirical comparison of web content extraction algorithms. InSIGIR, 2023

work page 2023

[12] [12]

N. McCurdy. dom-content-extraction, 2024.https://github.com/nickmccurdy/dom-c ontent-extraction. 11

work page 2024

[13] [13]

Qi and B

X. Qi and B. D. Davison. Web page classification: Features and algorithms.ACM Computing Surveys, 2009

work page 2009

[14] [14]

A. Broder. A taxonomy of web search.SIGIR F orum, 2002

work page 2002

[15] [15]

Bevendorff et al

J. Bevendorff et al. Elastic ChatNoir: Search engine for the ClueWeb and the Common Crawl. In ECIR, 2018

work page 2018

[16] [16]

M. Foley. rs-trafilatura: Page-type-aware web content extraction, 2026.https://crates.io/cr ates/rs-trafilatura

work page 2026

[17] [17]

V ogels, O

T. V ogels, O. E. Ganea, and C. Eickhoff. Web2Text: Deep structured boilerplate removal. InECIR, 2018

work page 2018

[18] [18]

Newspaper4k: Article scraping and curation, 2024.https://github .com/AndyTheFactory/newspaper4k

Newspaper4k Contributors. Newspaper4k: Article scraping and curation, 2024.https://github .com/AndyTheFactory/newspaper4k

work page 2024

[19] [19]

Grangier, T

D. Grangier, T. Huynh, and M. Loesgen. Goose: Open source article extractor, 2013.https: //github.com/goose3/goose3

work page 2013

[20] [20]

dom-smoothie: Fast content extraction, 2024.https://github.c om/nichochar/dom-smoothie

dom-smoothie Contributors. dom-smoothie: Fast content extraction, 2024.https://github.c om/nichochar/dom-smoothie

work page 2024

[21] [21]

magic-html: Generalised HTML content extraction, 2024.https://github.com /opendatalab/magic-html

OpenDataLab. magic-html: Generalised HTML content extraction, 2024.https://github.com /opendatalab/magic-html

work page 2024

[22] [22]

Bevendorff, M

J. Bevendorff, M. Potthast, and B. Stein. Resiliparse: A collection of robust and fast processing tools for parsing and analyzing web archive data, 2018–2024. Webis Group.https://resiliparse. chatnoir.eu/

work page 2018

[23] [23]

J. Li, J. P. Gardner, D. Kang, F. Shi, K. Singh, C.-L. Li, H. Shandilya, D. L. W. Hall, O. Tuzel, P. Liang, L. Schmidt, H. Pouransari, and F. Faghri. Beyond a single extractor: Re-thinking HTML-to- text extraction for LLM pre-training. InFindings of EACL, 2026. 12

work page 2026