WCXB: A Multi-Type Web Content Extraction Benchmark
Pith reviewed 2026-05-21 04:54 UTC · model grok-4.3
The pith
Web extractors reach high accuracy on news articles but show large performance gaps on forums, products, and other page types.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce the Web Content Extraction Benchmark consisting of 2,008 pages from 1,613 domains across seven page types. They show through systematic evaluation that while leading systems achieve an F1 score of 0.93 on articles, scores range from 0.41 to 0.84 on other structured page types, exposing weaknesses that article-only benchmarks miss.
What carries the argument
The multi-type web page dataset with ground truth annotations produced by a five-stage pipeline including LLM assistance and human review.
If this is right
- Existing evaluation practices underestimate the difficulty of extracting content from non-article web pages.
- Development of extraction systems should prioritize handling of forums, product pages, and similar structures.
- Applications relying on web content like retrieval-augmented generation may need to incorporate type-specific handling.
- Future benchmarks should adopt similar multi-type coverage to better reflect real-world web diversity.
Where Pith is reading between the lines
- Improved extraction across page types could enhance the quality of large-scale web data used for training language models.
- Search engines and content aggregators might see better results by adopting methods tuned to this benchmark's findings.
- Researchers could extend this work by testing new neural architectures specifically on the structured page subsets.
Load-bearing premise
The ground truth labels created by the five-stage annotation process correctly identify the main content on all seven page types.
What would settle it
An independent annotation effort on the test set that produces substantially different main content labels for a significant portion of pages.
read the original abstract
Web content extraction - isolating a page's main content from surrounding boilerplate - is a prerequisite for search indexing, retrieval-augmented generation, NLP dataset construction, and large language model training. Progress in this area has been constrained by the limitations of existing evaluation benchmarks, which are small (100-800 pages), restricted to news articles, or based on web pages from over a decade ago. We introduce the Web Content Extraction Benchmark (WCXB), a dataset of 2,008 web pages from 1,613 domains spanning seven structurally distinct page types: articles, forums, products, collections, listings, documentation, and service pages. The dataset includes a 1,497-page development set and a 511-page held-out test set with matched page type distributions. Ground truth annotations were produced through a five-stage pipeline: LLM-assisted drafting, automated verification, four-pass frontier model review, snippet and quality verification scripts, and human review. We evaluate 13 extraction systems - 11 heuristic and 2 neural - and find that while top systems converge on articles (F1 = 0.93), performance diverges sharply on structured page types (F1 = 0.41-0.84), revealing blind spots invisible to existing article-only benchmarks. The dataset is released under CC-BY-4.0 with HTML source files, ground truth annotations, page type labels, and baseline results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WCXB, a benchmark dataset of 2,008 web pages from 1,613 domains spanning seven page types (articles, forums, products, collections, listings, documentation, service pages), with a 1,497-page development set and 511-page held-out test set. Ground truth is created via a five-stage pipeline (LLM-assisted drafting, automated verification, four-pass frontier model review, snippet/quality scripts, human review). The authors evaluate 13 systems (11 heuristic, 2 neural) and report that top systems achieve F1=0.93 on articles but show sharp divergence (F1 0.41-0.84) on the other six types, arguing this exposes blind spots missed by prior article-only benchmarks. The dataset is released under CC-BY-4.0 with HTML sources, annotations, page-type labels, and baselines.
Significance. If the ground-truth labels prove reliable, the work would be significant for the field: it provides the first large-scale, multi-type benchmark for web content extraction, directly addressing the narrow scope of prior datasets (100-800 pages, news-only, or decade-old). The explicit release of raw HTML, annotations, and matched splits supports reproducibility and downstream use in search, RAG, and LLM training. The empirical finding that performance converges on articles but diverges elsewhere is a concrete, falsifiable observation that could guide targeted improvements in both heuristic and neural extractors.
major comments (1)
- [Annotation pipeline description (abstract and §3)] The central empirical claim—that performance divergence on structured page types reflects genuine system limitations rather than annotation artifacts—rests on the reliability of the five-stage ground-truth pipeline. No inter-annotator agreement scores, per-type error rates, or disagreement-resolution statistics are reported for the non-article categories (forums, listings, documentation). Without these quantitative checks, systematic bias on structurally complex pages could inflate the observed F1 spread (0.41-0.84).
minor comments (2)
- [§3.1] The abstract states 2,008 pages and 1,613 domains; the manuscript should explicitly confirm these counts and the exact page-type distribution in both dev and test splits (Table 1 or §3.1).
- [§4] Baseline system descriptions would benefit from a short table summarizing the 11 heuristics and 2 neural models (e.g., key parameters or public implementations) to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their detailed review and for highlighting the importance of validating the annotation pipeline. We provide a point-by-point response to the major comment below.
read point-by-point responses
-
Referee: The central empirical claim—that performance divergence on structured page types reflects genuine system limitations rather than annotation artifacts—rests on the reliability of the five-stage ground-truth pipeline. No inter-annotator agreement scores, per-type error rates, or disagreement-resolution statistics are reported for the non-article categories (forums, listings, documentation). Without these quantitative checks, systematic bias on structurally complex pages could inflate the observed F1 spread (0.41-0.84).
Authors: We acknowledge that reporting inter-annotator agreement or similar quantitative reliability metrics would provide additional assurance regarding the ground truth quality, particularly for the more structurally complex page types. Our annotation pipeline relies on a combination of LLM-assisted initial drafting, multiple automated verification steps, four-pass reviews by frontier models, and a final human review to minimize errors. However, the process was not designed with multiple independent human annotators, which precludes traditional IAA calculations. In response to this comment, we will revise Section 3 to include a more detailed description of the human review stage, including the specific instructions given to the annotator and examples of common corrections. Additionally, we will report the proportion of pages per type that required significant edits during the human review as an indirect measure of annotation reliability. We believe these additions will help substantiate the robustness of our ground truth without overclaiming the availability of IAA scores. revision: partial
Circularity Check
No circularity: empirical benchmark with independent ground truth
full rationale
This is an empirical dataset release and system evaluation paper with no mathematical derivations, parameter fitting, or predictive modeling. Ground truth is produced via an external five-stage annotation pipeline and then used to measure F1 scores on held-out test pages; the reported performance differences are direct measurements rather than quantities derived from the paper's own inputs or self-citations. No steps reduce by construction to the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The five-stage annotation pipeline produces accurate ground truth for main content on all page types.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Ground truth annotations were produced through a five-stage pipeline: LLM-assisted drafting, automated verification, four-pass frontier model review, snippet and quality verification scripts, and human review.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
while top systems converge on articles (F1 = 0.93), performance diverges sharply on structured page types (F1 = 0.41-0.84)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In WSDM, 2010
work page 2010
-
[2]
Readability, 2010.https://github.com/mozilla/readability
Mozilla. Readability, 2010.https://github.com/mozilla/readability
work page 2010
-
[3]
Pomikálek.Removing Boilerplate and Duplicate Content from Web Corpora
J. Pomikálek.Removing Boilerplate and Duplicate Content from Web Corpora. PhD thesis, Masaryk University, 2011
work page 2011
- [4]
- [5]
-
[6]
ReaderLM-v2: HTML to markdown with a small language model.arXiv:2503.01151, 2025
Jina AI. ReaderLM-v2: HTML to markdown with a small language model.arXiv:2503.01151, 2025
- [7]
-
[8]
M. E. Peters and D. Lecocq. Content extraction using diverse feature sets. InWWW, 2013
work page 2013
-
[9]
Article extraction benchmark, 2019.https://github.com/scrapinghub/a rticle-extraction-benchmark
ScrapingHub. Article extraction benchmark, 2019.https://github.com/scrapinghub/a rticle-extraction-benchmark
work page 2019
-
[10]
J. Leonhardt, A. Anand, and M. Khosla. Boilerplate removal using a neural sequence labeling model. InWWW Companion, 2020
work page 2020
-
[11]
J. Bevendorff, S. Gupta, J. Kiesel, and B. Stein. An empirical comparison of web content extraction algorithms. InSIGIR, 2023
work page 2023
-
[12]
N. McCurdy. dom-content-extraction, 2024.https://github.com/nickmccurdy/dom-c ontent-extraction. 11
work page 2024
- [13]
-
[14]
A. Broder. A taxonomy of web search.SIGIR F orum, 2002
work page 2002
-
[15]
J. Bevendorff et al. Elastic ChatNoir: Search engine for the ClueWeb and the Common Crawl. In ECIR, 2018
work page 2018
-
[16]
M. Foley. rs-trafilatura: Page-type-aware web content extraction, 2026.https://crates.io/cr ates/rs-trafilatura
work page 2026
-
[17]
T. V ogels, O. E. Ganea, and C. Eickhoff. Web2Text: Deep structured boilerplate removal. InECIR, 2018
work page 2018
-
[18]
Newspaper4k: Article scraping and curation, 2024.https://github .com/AndyTheFactory/newspaper4k
Newspaper4k Contributors. Newspaper4k: Article scraping and curation, 2024.https://github .com/AndyTheFactory/newspaper4k
work page 2024
-
[19]
D. Grangier, T. Huynh, and M. Loesgen. Goose: Open source article extractor, 2013.https: //github.com/goose3/goose3
work page 2013
-
[20]
dom-smoothie: Fast content extraction, 2024.https://github.c om/nichochar/dom-smoothie
dom-smoothie Contributors. dom-smoothie: Fast content extraction, 2024.https://github.c om/nichochar/dom-smoothie
work page 2024
-
[21]
magic-html: Generalised HTML content extraction, 2024.https://github.com /opendatalab/magic-html
OpenDataLab. magic-html: Generalised HTML content extraction, 2024.https://github.com /opendatalab/magic-html
work page 2024
-
[22]
J. Bevendorff, M. Potthast, and B. Stein. Resiliparse: A collection of robust and fast processing tools for parsing and analyzing web archive data, 2018–2024. Webis Group.https://resiliparse. chatnoir.eu/
work page 2018
-
[23]
J. Li, J. P. Gardner, D. Kang, F. Shi, K. Singh, C.-L. Li, H. Shandilya, D. L. W. Hall, O. Tuzel, P. Liang, L. Schmidt, H. Pouransari, and F. Faghri. Beyond a single extractor: Re-thinking HTML-to- text extraction for LLM pre-training. InFindings of EACL, 2026. 12
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.