arxiv: 2604.04948 · v1 · submitted 2026-03-30 · 💻 cs.IR · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

Alexandre Sousa, Br\'igida M\'onica Faria, Henrique Lopes Cardoso, Jos\'e Duarte, Jos\'e Guilherme Marques dos Santos, Jos\'e Lu\'is Reis, Jos\'e Paulo Marques dos Santos, Lu\'is Paulo Reis, Pedro Pimenta, Ricardo Yang, Rui Humberto Pereira

Pith reviewed 2026-05-14 02:01 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.LG

keywords RAGPDF conversiondocument preprocessingquestion answeringmetadata enrichmenthierarchy-aware chunkingGraphRAGLLM evaluation

0 comments

The pith

Metadata enrichment and hierarchy-aware chunking improve RAG accuracy more than the choice of PDF conversion framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper systematically tests four PDF-to-Markdown tools across 19 pipeline variants on 36 Portuguese administrative documents and a 50-question benchmark. It finds that adding metadata and using hierarchy-aware chunking lifts accuracy to 94.1 percent, well above the 86.9 percent baseline from a naive loader and close to the 97.1 percent from manually curated Markdown. The conversion tool itself matters less than these preprocessing choices. Font-based hierarchy detection beats LLM-based methods, while a basic GraphRAG setup scores only 82 percent.

Core claim

The central claim is that data preparation quality is the dominant factor in RAG performance for domain-specific question answering, with metadata enrichment and hierarchy-aware chunking contributing more to accuracy than the specific PDF conversion framework, as shown by Docling with hierarchical splitting reaching 94.1 percent versus lower scores for other tool combinations.

What carries the argument

Hierarchy-aware chunking paired with metadata enrichment, which uses font information to rebuild document structure and produce better retrieval units.

If this is right

Font-based hierarchy rebuilding outperforms LLM-based structure detection.
Metadata enrichment and hierarchical splitting raise accuracy substantially over basic loaders.
Naive GraphRAG without ontological guidance underperforms standard RAG.
Manual curation sets an upper bound at 97.1 percent, leaving room for automated gains.
Including image descriptions during conversion aids performance on documents with visuals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar preprocessing emphasis could improve RAG results in other languages or document types.
Teams should allocate more effort to chunking and enrichment than to switching conversion tools.
Ontology-guided graph construction may be needed to make GraphRAG competitive.
Expanding the benchmark or adding human raters would test the stability of the LLM-judge results.

Load-bearing premise

That LLM-as-judge scores on 50 questions reliably measure true downstream question-answering quality without human validation or error bars.

What would settle it

A side-by-side human evaluation of answers from the best automated pipeline and the naive baseline on the same 50 questions.

Figures

Figures reproduced from arXiv: 2604.04948 by Alexandre Sousa, Br\'igida M\'onica Faria, Henrique Lopes Cardoso, Jos\'e Duarte, Jos\'e Guilherme Marques dos Santos, Jos\'e Lu\'is Reis, Jos\'e Paulo Marques dos Santos, Lu\'is Paulo Reis, Pedro Pimenta, Ricardo Yang, Rui Humberto Pereira.

**Figure 1.** Figure 1: ETL pipeline workflow. Raw PDFs from the Bronze layer are extracted and transformed into intermediate Markdown (Silver layer), then cleaned and finalized into RAG-ready Markdown with extracted assets (Gold layer). The pipeline supports several configurable transformation options designed to address known issues in framework outputs: HTML table cleaning (converting HTML tables to Markdown tables), LaTeX fo… view at source ↗

**Figure 3.** Figure 3: Knowledge graph data model. TextChunk nodes store text content with source metadata and embedding vectors. Entity nodes store a unique identifier, name, and semantic type. MENTIONS relationships link text chunks to the entities extracted from them. RELATED relationships capture semantic connections between entities. A semantic deduplication pipeline was subsequently applied to address entity duplication … view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 19 pipeline configurations for extracting text and other contents from PDFs, varying the conversion tool, cleaning transformations, splitting strategy, and metadata enrichment. Evaluation was performed using a manually curated 50-question benchmark over a corpus of 36 Portuguese administrative documents (1,706 pages, ~492K words), with LLM-as-judge scoring averaged over 10 runs. Two baselines bounded the results: na\"ive PDFLoader (86.9%) and manually curated Markdown (97.1%). Docling with hierarchical splitting and image descriptions achieved the highest automated accuracy (94.1%). Metadata enrichment and hierarchy-aware chunking contributed more to accuracy than the conversion framework choice alone. Font-based hierarchy rebuilding consistently outperformed LLM-based approaches. An exploratory GraphRAG implementation scored only 82%, underperforming basic RAG, suggesting that na\"ive knowledge graph construction without ontological guidance does not yet justify its added complexity. These findings demonstrate that data preparation quality is the dominant factor in RAG system performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This gives practical numbers on PDF tools for RAG but the LLM-judge scores on 50 questions lack validation so the rankings and relative-impact claims are not yet solid.

read the letter

The paper runs a head-to-head test of four PDF-to-Markdown converters on 36 Portuguese administrative PDFs, measuring downstream RAG QA accuracy across 19 configurations. Docling with hierarchical splitting and image descriptions reached 94.1 percent, beating the naive baseline at 86.9 percent and sitting close to the manual gold at 97.1 percent. They also tried an exploratory GraphRAG setup that scored lower at 82 percent. That is the concrete new data point: no earlier work had tied converter choice directly to QA accuracy on this kind of corpus with this many pipeline variants.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates four open-source PDF-to-Markdown conversion frameworks (Docling, MinerU, Marker, DeepSeek OCR) across 19 pipeline configurations varying conversion tool, cleaning, splitting strategy, and metadata enrichment. On a manually curated 50-question benchmark over 36 Portuguese administrative documents (1,706 pages), LLM-as-judge scoring averaged over 10 runs shows Docling with hierarchical splitting and image descriptions reaching 94.1% accuracy, above a naive PDFLoader baseline (86.9%) but below manually curated Markdown (97.1%). The authors conclude that metadata enrichment and hierarchy-aware chunking contribute more to accuracy than conversion framework choice alone, while an exploratory GraphRAG scores only 82%.

Significance. If the accuracy attributions hold, the work supplies a useful empirical benchmark for RAG preprocessing in domain-specific settings, underscoring that data-preparation choices dominate over tool selection. Strengths include explicit baselines, multiple runs, a held-out question set, and a manually curated gold standard that ground the comparisons.

major comments (2)

[Evaluation] Evaluation section: the claim that metadata enrichment and hierarchy-aware chunking contributed more to accuracy than conversion framework choice is based solely on LLM-as-judge scores averaged over 10 runs on a fixed 50-question set. No human correlation, inter-annotator agreement, per-question variance, confidence intervals, or significance tests are reported, so observed gaps (e.g., Docling hierarchical+images at 94.1%) could arise from judge preference for particular markdown or chunk formats rather than genuine QA quality.
[Results] Results section: without statistical tests or error bars on the 10-run averages, it is impossible to determine whether differences across the 19 configurations are reliable or could be explained by judge variability alone.

minor comments (1)

[Abstract] Abstract: 'naïve' is rendered with an escaped quote; standardize to 'naive' for readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on our evaluation methodology. We address the major points below and will incorporate statistical enhancements in the revision to improve rigor.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the claim that metadata enrichment and hierarchy-aware chunking contributed more to accuracy than conversion framework choice is based solely on LLM-as-judge scores averaged over 10 runs on a fixed 50-question set. No human correlation, inter-annotator agreement, per-question variance, confidence intervals, or significance tests are reported, so observed gaps (e.g., Docling hierarchical+images at 94.1%) could arise from judge preference for particular markdown or chunk formats rather than genuine QA quality.

Authors: We acknowledge that the attribution of greater impact to metadata enrichment and hierarchy-aware chunking rests on patterns observed across the 19 configurations using LLM-as-judge scores. These patterns emerge from controlled variations where hierarchy and metadata were toggled independently of the conversion tool, with consistent gains (e.g., hierarchical splitting outperforming naive chunking across Docling, MinerU, and Marker). To address the concern, we will add per-configuration standard deviations as error bars, report per-question variance in an appendix, include confidence intervals, and apply paired statistical tests (e.g., t-tests with multiple-comparison correction) to the key differences. We will also expand the limitations section to note that LLM-as-judge may introduce format biases and that human correlation studies remain valuable future work. revision: partial
Referee: [Results] Results section: without statistical tests or error bars on the 10-run averages, it is impossible to determine whether differences across the 19 configurations are reliable or could be explained by judge variability alone.

Authors: We agree that the absence of error bars and formal statistical tests limits interpretability of the 10-run averages. In the revised manuscript we will add mean ± standard deviation error bars to all tables and figures, and include appropriate tests (ANOVA followed by post-hoc pairwise comparisons with correction) to establish which differences between the 19 pipelines are statistically significant. This will clarify whether gaps such as the 94.1% vs. 86.9% baseline are robust to judge variability. revision: yes

standing simulated objections not resolved

A full human evaluation study with inter-annotator agreement to correlate against LLM-as-judge scores, which exceeds the scope and resources of the current revision.

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with external baselines

full rationale

The paper conducts a controlled empirical comparison of four PDF-to-Markdown tools across 19 configurations on a fixed 50-question set, reporting LLM-as-judge accuracies against two external baselines (naïve PDFLoader at 86.9% and manual Markdown at 97.1%). No derivations, equations, or first-principles results exist that could reduce to fitted parameters or self-referential definitions. Conclusions about metadata enrichment and hierarchy-aware chunking are drawn directly from observed accuracy deltas, not from any self-citation chain, uniqueness theorem, or ansatz smuggled via prior work. The methodology is self-contained against the held-out question set and does not rename known results or import load-bearing premises from the authors' own citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the standard assumption that LLM-as-judge scores correlate with human judgments of answer quality and that the 50-question set is representative of real user needs in the domain.

axioms (1)

domain assumption LLM-as-judge produces reliable accuracy estimates for RAG outputs
Used to score all 19 configurations and baselines without reported human validation

pith-pipeline@v0.9.0 · 5599 in / 1172 out tokens · 58659 ms · 2026-05-14T02:01:10.447492+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Metadata enrichment and hierarchy-aware chunking contributed more to accuracy than the conversion framework choice alone.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
Docling with hierarchical splitting and image descriptions achieved the highest automated accuracy (94.1%).

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 4 internal anchors

[1]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W. -t.; Rocktäschel, T. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2020), Online Conference, 2020; pp. 9459-9474, doi:10.48550/arXiv.2005.11401

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.11401 2020
[2]

Retrieval-Augmented Generation for Large Language Models: A Survey

Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval -Augmented Generation for Large Language Models: A Survey. arXiv 2023, arXiv:2312.10997, doi:10.48550/arXiv.2312.10997

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997 2023
[3]

A review on Retrieval -Augmented Generation: Architectures, research challenges, and emerging frontiers

Sharma, P.; Bhattarai, S. A review on Retrieval -Augmented Generation: Architectures, research challenges, and emerging frontiers. Journal of Future Artificial Intelligence and Technologies 2026, 2, 616-628, doi:10.62411/faith.3048-3719-297

work page doi:10.62411/faith.3048-3719-297 2026
[4]

A RAG data pipeline transforming heterogeneous data into AI -ready format for autonomous building performance discovery

Li, H.; Comesana, A.; Weyandt, C.; Hong, T. A RAG data pipeline transforming heterogeneous data into AI -ready format for autonomous building performance discovery. Advances in Applied Energy 2026, 21, 100261, doi:10.1016/j.adapen.2025.100261

work page doi:10.1016/j.adapen.2025.100261 2026
[5]

Maximizing RAG efficiency: A comparative analysis of RAG methods

Şakar, T.; Emekci, H. Maximizing RAG efficiency: A comparative analysis of RAG methods. Natural Language Processing 2025, 31, 1-25, doi:10.1017/nlp.2024.53

work page doi:10.1017/nlp.2024.53 2025
[6]

Mitigating Hallucination by Integrating Knowledge Graphs into LLM Inference – a Systematic Literature Review

Wagner, R.; Kitzelmann, E.; Boersch, I. Mitigating Hallucination by Integrating Knowledge Graphs into LLM Inference – a Systematic Literature Review. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, July, 2025; pp. 795-805, doi:10.18653/v1/2025.acl-srw.53

work page doi:10.18653/v1/2025.acl-srw.53 2025
[7]

Docling Technical Report

Auer, C.; Lysak, M.; Nassar, A.; Dolfi, M.; Livathinos, N.; Vagenas, P.; Berrospi Ramis, C.; Omenetti, M.; Lindlbauer, F.; Dinkla, K.; et al. Docling Technical Report. arXiv 2024, arXiv:2408.09869, doi:10.48550/arXiv.2408.09869

work page doi:10.48550/arxiv.2408.09869 2024
[8]

Do- cling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025

Livathinos, N.; Auer, C.; Lysak, M.; Nassar, A.; Dolfi, M.; Vagenas, P.; Berrospi Ramis, C.; Omenetti, M.; Dinkla, K.; Kim, Y.; et al. Docling: An Efficient Open -Source Toolkit for AI -driven Document Conversion. arXiv 2025, arXiv:2501.17887, doi:10.48550/arXiv.2501.17887

work page doi:10.48550/arxiv.2501.17887 2025
[9]

Patel, and Shao-Yuan Lo

Ouyang, L.; Qu, Y.; Zhou, H.; Zhu, J.; Zhang, R.; Lin, Q.; Wang, B.; Zhao, Z.; Jiang, M.; Zhao, X.; et al. OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations. In Proceedings of the 2025 IEEE/CVF Conference on Computer V ision and Pattern Recognition (CVPR), Nashville (TN), USA, June ,, 2025; pp. 24838 -24848, doi:10.110...

work page doi:10.1109/cvpr52734.2025.02313 2025
[10]

arXiv preprint arXiv:2409.18839 , year=

Wang, B.; Xu, C.; Zhao, X.; Ouyang, L.; Wu, F.; Zhao, Z.; Xu, R.; Liu, K.; Qu, Y.; Shang, F.; et al. MinerU: An open -source solution for precise document content extraction. arXiv 2024, arXiv:2409.18839, doi:10.48550/arXiv.2409.18839

work page doi:10.48550/arxiv.2409.18839 2024
[11]

marker-pdf 0.3.2: Convert PDF to markdown with high speed and accuracy

Paruchuri, V.; Kwon, S.; Menta, T.R. marker-pdf 0.3.2: Convert PDF to markdown with high speed and accuracy. Available online: https://github.com/datalab-to/marker (accessed on 21 of 21

work page
[12]

DeepSeek-OCR: Contexts Optical Compression

Wei, H.; Sun, Y.; Li, Y. DeepSeek -OCR: Contexts optical compression. arXiv 2025, arXiv:2510.18234, doi:10.48550/arXiv.2510.18234

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.18234 2025
[13]

Benchmarking Vision -Language Models for French PDF -to- Markdown Conversion

Rigal, B.; Dupriez, V.; Mignon, A.; Le Hy, R.; Mery, N. Benchmarking Vision -Language Models for French PDF -to- Markdown Conversion. arXiv 2026, arXiv:2602.11960, doi:10.48550/arXiv.2602.11960

work page doi:10.48550/arxiv.2602.11960 2026
[14]

READoc: A Unified Benchmark for Realistic Document Structured Extraction

Li, Z.; Abulaiti, A.; Lu, Y.; Chen, X.; Zheng, J.; Lin, H.; Han, X.; Sun, L. READoc: A Unified Benchmark for Realistic Document Structured Extraction. arXiv 2024, arXiv:2409.05137, doi:10.48550/arXiv.2409.05137

work page doi:10.48550/arxiv.2409.05137 2024
[15]

An approach based on Open Research Knowledge Graph for Knowledge Acquisition from scientific papers

Jiomekong, A.; Tiwari, S. An approach based on Open Research Knowledge Graph for Knowledge Acquisition from scientific papers. arXiv 2023, arXiv:2308.12981, doi:10.48550/arXiv.2308.12981

work page doi:10.48550/arxiv.2308.12981 2023
[16]

Docs2KG: Unified knowledge graph construction from heterogeneous documents assisted by Large Language Models

Sun, Q.; Luo, Y.; Zhang, W.; Li, S.; Li, J.; Niu, K.; Kong, X.; Liu, W. Docs2KG: Unified knowledge graph construction from heterogeneous documents assisted by Large Language Models. arXiv 2024, arXiv:2406.02962, doi:10.48550/arXiv.2406.02962

work page doi:10.48550/arxiv.2406.02962 2024
[17]

Medallion Architecture

Steelman Jr., R.L. Medallion Architecture. In Mastering Snowflake DataOps with DataOps.live: An End-to-End Guide to Modern Data Management; Apress: Berkeley, CA, 2025; pp. 247-264, doi:10.1007/979-8-8688-1754-0_23

work page doi:10.1007/979-8-8688-1754-0_23 2025
[18]

Ontology -grounded knowledge graphs for mitigating hallucinations in large language models for clinical question answering

Ali, M.; Taha, Z.; Morsey, M.M. Ontology -grounded knowledge graphs for mitigating hallucinations in large language models for clinical question answering. Journal of Biomedical Informatics 2026, 175, 104993, doi:10.1016/j.jbi.2026.104993

work page doi:10.1016/j.jbi.2026.104993 2026
[19]

Hallucination -resistant multimodal content generation through knowledge graph -based reinforcement learning

Zeng, L.; Lin, X.; Yu, S. Hallucination -resistant multimodal content generation through knowledge graph -based reinforcement learning. Information Fusion 2026, 127, 103783, doi:10.1016/j.inffus.2025.103783

work page doi:10.1016/j.inffus.2025.103783 2026
[20]

LOSS -J Project Repository — Data Lakehouse Architecture

The LOSS -L Project - Locate Organize Summarize Suggest and Justify. LOSS -J Project Repository — Data Lakehouse Architecture. Available online: https://github.com/sousaalexandre/loss-j/blob/main/docs/data-lakehouse-architecture.pdf (accessed on

work page
[21]

PyPDFLoader integration

LangChain. PyPDFLoader integration. Available online: https://docs.langchain.com/oss/python/integrations/document_loaders/pypdfloader (accessed on

work page
[22]

RAGAs: Automated Evaluation of Retrieval Augmented Generation

Es, S.; James, J.; Espinosa Anke, L.; Schockaert, S. RAGAs: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, St. Julians, Malta, March, 2024; pp. 150-158, doi:10.18653/v1/2024.eacl-demo.16

work page doi:10.18653/v1/2024.eacl-demo.16 2024
[23]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as- a-Judge with MT-Bench and Chatbot Arena. arXiv 2023, arXiv:2306.05685, doi:10.48550/arXiv.2306.05685

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023
[24]

gpt-oss-120b

OpenAI. gpt-oss-120b. Available online: https://huggingface.co/openai/gpt-oss-120b (accessed on

work page
[25]

Core -based Hierarchies for Efficient GraphRAG

Hossain, J.; Erdem Sarıyüce, A. Core -based Hierarchies for Efficient GraphRAG. arXiv 2026, arXiv:2603.05207, doi:10.48550/arXiv.2603.05207

work page doi:10.48550/arxiv.2603.05207 2026