arxiv: 2602.15189 · v2 · submitted 2026-02-16 · 💻 cs.IR · cs.AI· cs.CL

Recognition: no theorem link

ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation

William Brach , Francesco Zuppichini , Marco Vinciguerra , Lorenzo Padoan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:33 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL

keywords schema-constrained generationstructured outputJSON schemaLLM distillationweb extractiontelemetry datasetfine-tuningconformance evaluation

0 comments

The pith

A dataset of 93,695 real schema-constrained extraction events from ScrapeGraphAI usage lets a 1.7B model track its GPT-5-nano teacher's output distribution while still trailing a larger reference on schema compliance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ScrapeGraphAI-100k, a corpus of nearly 94,000 paired examples drawn from actual tool usage rather than synthetic generation. Each example supplies Markdown page content, a user prompt, a target JSON schema, the LLM response, and structural validity labels computed with jsonschema-rs. The authors characterize the diversity across more than 18,000 unique schemas and note clear performance drops once schema complexity passes certain thresholds. In a distillation experiment, fine-tuning a 1.7B student on the data produces outputs whose distribution closely matches the GPT-5-nano teacher, though the student still underperforms a 30B-A3B model on strict schema adherence.

Core claim

The central claim is that grounding schema-constrained generation training in large-scale, real-world telemetry data enables small models to approximate the structured output behavior of larger teachers in ways that earlier synthetic or text-only corpora could not achieve.

What carries the argument

The ScrapeGraphAI-100k dataset of 93,695 deduplicated, schema-balanced extraction events, each containing Markdown content, prompt, schema, response, and structural conformance labels.

If this is right

Fine-tuning on real practitioner data can produce small models whose structured outputs closely track those of a larger teacher model.
The dataset supports benchmarking of schema-constrained generation that goes beyond what synthetic corpora allow.
Performance degrades sharply once schema complexity exceeds identifiable thresholds.
Scaling data collection from live tool usage provides a viable path for improving structured extraction capabilities.
Semantic correctness remains out of scope, so the dataset focuses strictly on structural conformance to the supplied JSON schema.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Collecting telemetry from production tools may reduce the distribution shift that synthetic datasets introduce for structured generation tasks.
The same approach could be extended to other tool-use scenarios such as function calling or API orchestration.
If the structural labels prove reliable, the corpus could serve as a test bed for measuring how schema complexity interacts with model scale.
Future versions that add raw HTML or semantic verification would further strengthen the resource for end-to-end extraction pipelines.

Load-bearing premise

The opt-in telemetry events collected from ScrapeGraphAI users are representative of diverse real-world schema-constrained tasks without selection bias from the tool's specific user base or usage patterns.

What would settle it

Retraining the 1.7B student on the dataset and measuring that its output distribution no longer matches the GPT-5-nano teacher or that its schema-compliance rate fails to approach the reported level would falsify the distillation result.

Figures

Figures reproduced from arXiv: 2602.15189 by Francesco Zuppichini, Lorenzo Padoan, Marco Vinciguerra, William Brach.

**Figure 1.** Figure 1: Schema complexity distributions in ScrapeGraphAI-100k (depth, key count, elements, cyclomatic complexity, compos [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Schema validation rate versus complexity (depth, key count, composite score); the dashed line marks the corpus mean [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Response size distribution for 93,695 extraction [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overall BLEU vs. model size on the evaluation set. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Correlation matrix of dataset metrics, showing [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Content language distribution (top 15 + Other) for [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Producing output that conforms to a specified JSON schema underlies tool use, structured extraction, and knowledge base construction in modern large language models. Despite this centrality, public datasets for the task remain small, synthetic, or text-only, and rarely pair real page content with the prompts and schemas used in practice. We introduce ScrapeGraphAI-100k, 93,695 schema-constrained extraction events collected via opt-in ScrapeGraphAI telemetry in Q2--Q3 2025, deduplicated and balanced by schema from 9M raw events. The corpus spans 18 000+ unique schemas across 15 named languages plus a long-tail Other category, with English and Traditional Chinese covering 88% of detected content, each instance pairs Markdown-converted page content with a prompt, schema, LLM response, and per-example jsonschema-rs structural conformance labels (semantic correctness is out of scope, and raw HTML is deferred beyond v1.0). We characterize structural diversity across the corpus and identify sharp failure thresholds as schema complexity grows. As a case study, a 1.7B student fine-tuned on this data closely tracks the output distribution of its GPT-5-nano teacher, though it still trails a 30B-A3B reference (3.3B active parameters) on schema compliance. We offer this distillation result as preliminary evidence that grounding schema-constrained generation in real practitioner workloads at scale enables training and benchmarking that prior synthetic or text-only corpora could not support.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A solid data release of 93k real schema-constrained examples, but the single-tool origin and light validation keep the claims modest.

read the letter

The main thing here is the release of ScrapeGraphAI-100k: 93k paired examples of real page content, prompts, schemas, and LLM responses drawn from opt-in telemetry. Earlier work relied on small or synthetic sets, so this scale with actual practitioner data is the clear step forward. They deduplicated from 9M events, balanced by schema, and labeled structural conformance with jsonschema-rs across 18k unique schemas. The notes on how compliance drops with schema complexity are straightforward and potentially useful for anyone training these models. The case study fine-tunes a 1.7B student on the data and shows it tracks the teacher distribution, which gives at least preliminary support for using real workloads instead of synthetic ones. The soft spots are straightforward. Everything comes from users of one scraping tool, so domain and schema skew from opt-in participation is a real risk with no demographics or bias checks provided. Labels cover only structural match, semantic correctness is explicitly out of scope, and the fine-tuning results stay qualitative without baselines or exact compliance numbers. This is for researchers who need large paired examples for structured output, tool use, or knowledge-base work. Readers building or benchmarking schema-constrained generators will get concrete material from the corpus even if they treat the case study as early evidence. It deserves a serious referee because the data itself is new and externally usable, and the limitations are stated plainly enough for review to focus on scope and generalizability.

Referee Report

3 major / 2 minor

Summary. The paper introduces ScrapeGraphAI-100k, a dataset of 93,695 schema-constrained extraction events collected via opt-in ScrapeGraphAI telemetry (Q2–Q3 2025), deduplicated and balanced by schema from 9M raw events. It spans 18,000+ unique schemas, pairs Markdown page content with prompts/schemas/LLM responses, and supplies jsonschema-rs structural conformance labels (semantic correctness out of scope). A case study shows a 1.7B student model fine-tuned on the data closely tracks its GPT-5-nano teacher’s output distribution while trailing a 30B-A3B reference on schema compliance, offered as preliminary evidence for real-world grounding in schema-constrained generation.

Significance. If the dataset is representative, it would provide a large-scale, real-practitioner resource for training and benchmarking schema-constrained LLM generation that existing synthetic or text-only corpora cannot match. The distillation case study supplies preliminary evidence that grounding in actual workloads can enable effective student-teacher tracking, a strength given the parameter-free nature of the data release itself.

major comments (3)

[Corpus collection] Corpus collection section: The claim that the 93k deduplicated events support general schema-constrained training and benchmarking rests on the unexamined assumption that opt-in ScrapeGraphAI telemetry is representative; no analysis of user demographics, domain skew, opt-in participation bias, or correlation with schema complexity/task type is provided, directly affecting generalizability as highlighted by the weakest assumption.
[Case study] Case study section: The central claim that the 1.7B student 'closely tracks' the GPT-5-nano teacher’s output distribution is supported only by qualitative description; no quantitative metrics (exact match rates, schema compliance percentages, or distributional distances) or baselines are reported, weakening the distillation evidence relative to the 30B-A3B reference.
[Structural labels] Structural labels description: While jsonschema-rs conformance labels are supplied, the manuscript provides no quantitative validation, error analysis, or assessment of label reliability, which is load-bearing because structural conformance is the only labeled signal and semantic correctness is explicitly out of scope.

minor comments (2)

[Abstract] Abstract: The phrase '30B-A3B reference (3.3B active parameters)' should be expanded on first use to clarify the architecture for readers outside the specific model family.
[Corpus characterization] Corpus characterization: The reported language distribution (English and Traditional Chinese at 88%) would benefit from an explicit table or figure showing the full 15-language breakdown plus the detection method used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on representativeness, evaluation metrics, and label validation. We address each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Corpus collection] Corpus collection section: The claim that the 93k deduplicated events support general schema-constrained training and benchmarking rests on the unexamined assumption that opt-in ScrapeGraphAI telemetry is representative; no analysis of user demographics, domain skew, opt-in participation bias, or correlation with schema complexity/task type is provided, directly affecting generalizability as highlighted by the weakest assumption.

Authors: We agree this is a valid concern for claims of generalizability. The dataset reflects real opt-in usage of ScrapeGraphAI and is not positioned as a demographically balanced sample. In revision we will add an explicit Limitations section discussing potential biases (e.g., English/Traditional Chinese dominance at 88%, opt-in participation effects, and schema complexity skew), clarify that the corpus is intended to capture practitioner workloads rather than universal coverage, and expand the description of the deduplication and schema-balancing procedure to better bound its scope. revision: yes
Referee: [Case study] Case study section: The central claim that the 1.7B student 'closely tracks' the GPT-5-nano teacher’s output distribution is supported only by qualitative description; no quantitative metrics (exact match rates, schema compliance percentages, or distributional distances) or baselines are reported, weakening the distillation evidence relative to the 30B-A3B reference.

Authors: We acknowledge the evaluation is currently qualitative. In the revised manuscript we will add quantitative results: exact match rates on schema fields, schema compliance percentages computed via the supplied jsonschema-rs labels, and distributional metrics (e.g., structural KL divergence or tree-edit distance) comparing the 1.7B student against both the GPT-5-nano teacher and the 30B-A3B reference. These will be reported in a new table alongside the existing qualitative observations. revision: yes
Referee: [Structural labels] Structural labels description: While jsonschema-rs conformance labels are supplied, the manuscript provides no quantitative validation, error analysis, or assessment of label reliability, which is load-bearing because structural conformance is the only labeled signal and semantic correctness is explicitly out of scope.

Authors: The labels are produced by the deterministic jsonschema-rs validator, so structural errors are well-defined. We agree an explicit reliability assessment strengthens the release. In revision we will add a subsection detailing the label-generation pipeline, report corpus-wide conformance statistics (e.g., pass rates by schema complexity), and include a brief error analysis of common structural failure modes (type mismatches, missing required fields, etc.) to document label quality. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces a new dataset collected from opt-in telemetry and presents a preliminary fine-tuning case study as empirical evidence. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations are present. The claims about dataset characteristics and model tracking are descriptive and observational rather than reducing to inputs by construction. This is a standard dataset release paper with no self-referential loops.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; dataset construction implicitly involves choices such as deduplication thresholds and schema balancing criteria that are not detailed.

pith-pipeline@v0.9.0 · 5585 in / 1116 out tokens · 30332 ms · 2026-05-15T21:33:15.535819+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction
cs.AI 2026-04 unverdicted novelty 5.0

Web2BigTable introduces a bi-level multi-agent system that achieves new state-of-the-art results on wide-coverage and deep web-to-table search benchmarks through orchestration, coordination, and closed-loop reflection.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Janek Bevendorff, Sanket Gupta, Johannes Kiesel, and Benno Stein. 2023. An Empirical Comparison of Web Content Extraction Algorithms. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval(Taipei, Taiwan)(SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 2594–2603. doi:10.1145/35...

work page doi:10.1145/3539618.3591920 2023
[2]

Common Crawl Foundation. 2026. Common Crawl Dataset. https:// commoncrawl.org/ Accessed: 2026-01-06

work page 2026
[3]

Domenico Dato, Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, and Nicola Tonellotto. 2022. The Istella22 Dataset: Bridging Traditional and Neural Learning to Rank Evaluation. InProceedings of the 45th International ACM SI- GIR Conference on Research and Development in Information Retrieval(Madrid, Spain)(SIGIR ’22). Association for Computing Machi...

work page doi:10.1145/3477495.3531740 2022
[4]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:arXiv:2305.14314

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Qiang Hao, Rui Cai, Yanwei Pang, and Lei Zhang. 2011. From One Tree to a Forest: A Unified Solution for Structured Web Data Extraction. InProceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval(Beijing, China)(SIGIR ’11). ACM, New York, NY, USA, 775–784. doi:10.1145/2009916.2010020

work page doi:10.1145/2009916.2010020 2011
[6]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL] https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of Tricks for Efficient Text Classification.arXiv preprint arXiv:1607.01759(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran- Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tris- tan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Gan- guli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kr...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAtten- tion. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

work page 2023
[10]

2023.LangChain Benchmarks

LangChain AI. 2023.LangChain Benchmarks. https://langchain-ai.github.io/ langchain-benchmarks/ Accessed: 2026-01-06

work page 2023
[11]

Mengjie Liu, Jiahui Peng, Pei Chu, Jiantao Qiu, Ren Ma, He Zhu, Rui Min, Lindong Lu, Wenchang Ning, Linfeng Hou, Kaiwen Liu, Yuan Qu, Zhenxiang Li, Chao Xu, Zhongying Tu, Wentao Zhang, and Conghui He. 2025. Dripper: Token- Efficient Main HTML Extraction with a Lightweight LM. arXiv:2511.23119 [cs.CL] https://arxiv.org/abs/2511.23119

work page arXiv 2025
[12]

Zihan Liu, Yan Xu, Tiezheng Yu, Wenliang Dai, Ziwei Ji, Samuel Cahyawijaya, Andrea Madotto, and Pascale Fung. 2020. CrossNER: Evaluating Cross-Domain Named Entity Recognition. arXiv:2012.04373 [cs.CL] https://arxiv.org/abs/2012. 04373

work page arXiv 2020
[13]

Colin Lockard, Prashant Shiralkar, and Xin Luna Dong. 2019. OpenCeres: When Open Information Extraction Meets the Semi-Structured Web. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar S...

work page 2019
[14]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. 2023. SelfCheckGPT: Zero- Resource Black-Box Hallucination Detection for Generative Large Language Models. arXiv:2303.08896 [cs.CL] https://arxiv.org/abs/2303.08896

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.arXiv preprint arXiv:1611.09268(2016). https: //arxiv.org/abs/1611.09268

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. 2025. RouteLLM: Learning to Route LLMs with Preference Data. arXiv:2406.18665 [cs.LG] https://arxiv.org/abs/2406. 18665

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Pierre Isabelle, Eugene Charniak, and Dekang Lin (Eds.). Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 31...

work page doi:10.3115/1073083 2002
[18]

Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=n6SCkn2QaG

work page 2024
[19]

2024.Scrapegraph-ai

Marco Perini, Lorenzo Padoan, and Marco Vinciguerra. 2024.Scrapegraph-ai. https://github.com/VinciGit00/Scrapegraph-ai

work page 2024
[20]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html

work page 2020
[21]

Nathan Ranchin et al . 2025. JSONSchemaBench: A Rigorous Benchmark of Structured Outputs for Language Models.arXiv preprint arXiv:2501.10868(2025). https://arxiv.org/abs/2501.10868

work page arXiv 2025
[22]

Josselin Somerville Roberts, Tony Lee, Chi Heem Wong, Michihiro Yasunaga, Yifan Mai, and Percy Liang. 2024. Image2Struct: Benchmarking Structure Ex- traction for Vision-Language Models. arXiv:2410.22456 [cs.CV] https://arxiv. ScrapeGraphAI et al. org/abs/2410.22456

work page arXiv 2024
[23]

Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, and Ji-Rong Wen. 2025. HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems. InProceedings of the ACM on Web Conference 2025 (WWW ’25). ACM, 1733–1746. doi:10.1145/3696410.3714546

work page doi:10.1145/3696410.3714546 2025
[24]

Jialong Tang, Hongyu Lin, Zhuoqun Li, Yaojie Lu, Xianpei Han, and Le Sun. 2023. Harvesting Event Schemas from Large Language Models. arXiv:2305.07280 [cs.CL] https://arxiv.org/abs/2305.07280

work page arXiv 2023
[25]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Darren Yow-Bang Wang, Zhengyuan Shen, Soumya Smruti Mishra, Zhichao Xu, Yifei Teng, and Haibo Ding. 2025. SLOT: Structuring the Output of Large Language Models. arXiv:2505.04016 [cs.CL] https://arxiv.org/abs/2505.04016

work page arXiv 2025
[27]

schema":

Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. 2023. Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment. arXiv:2312.12148 [cs.CL] https://arxiv.org/abs/ 2312.12148 A Example Dataset Entry The following JSON object illustrates a complete entry from ScrapeGraphAI- 100k. This exam...

work page doi:10.4324/9780203881514-11/ 2023