Recognition: 2 theorem links
· Lean TheoremFrom PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering
Pith reviewed 2026-05-14 02:01 UTC · model grok-4.3
The pith
Metadata enrichment and hierarchy-aware chunking improve RAG accuracy more than the choice of PDF conversion framework.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that data preparation quality is the dominant factor in RAG performance for domain-specific question answering, with metadata enrichment and hierarchy-aware chunking contributing more to accuracy than the specific PDF conversion framework, as shown by Docling with hierarchical splitting reaching 94.1 percent versus lower scores for other tool combinations.
What carries the argument
Hierarchy-aware chunking paired with metadata enrichment, which uses font information to rebuild document structure and produce better retrieval units.
If this is right
- Font-based hierarchy rebuilding outperforms LLM-based structure detection.
- Metadata enrichment and hierarchical splitting raise accuracy substantially over basic loaders.
- Naive GraphRAG without ontological guidance underperforms standard RAG.
- Manual curation sets an upper bound at 97.1 percent, leaving room for automated gains.
- Including image descriptions during conversion aids performance on documents with visuals.
Where Pith is reading between the lines
- Similar preprocessing emphasis could improve RAG results in other languages or document types.
- Teams should allocate more effort to chunking and enrichment than to switching conversion tools.
- Ontology-guided graph construction may be needed to make GraphRAG competitive.
- Expanding the benchmark or adding human raters would test the stability of the LLM-judge results.
Load-bearing premise
That LLM-as-judge scores on 50 questions reliably measure true downstream question-answering quality without human validation or error bars.
What would settle it
A side-by-side human evaluation of answers from the best automated pipeline and the naive baseline on the same 50 questions.
Figures
read the original abstract
Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 19 pipeline configurations for extracting text and other contents from PDFs, varying the conversion tool, cleaning transformations, splitting strategy, and metadata enrichment. Evaluation was performed using a manually curated 50-question benchmark over a corpus of 36 Portuguese administrative documents (1,706 pages, ~492K words), with LLM-as-judge scoring averaged over 10 runs. Two baselines bounded the results: na\"ive PDFLoader (86.9%) and manually curated Markdown (97.1%). Docling with hierarchical splitting and image descriptions achieved the highest automated accuracy (94.1%). Metadata enrichment and hierarchy-aware chunking contributed more to accuracy than the conversion framework choice alone. Font-based hierarchy rebuilding consistently outperformed LLM-based approaches. An exploratory GraphRAG implementation scored only 82%, underperforming basic RAG, suggesting that na\"ive knowledge graph construction without ontological guidance does not yet justify its added complexity. These findings demonstrate that data preparation quality is the dominant factor in RAG system performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates four open-source PDF-to-Markdown conversion frameworks (Docling, MinerU, Marker, DeepSeek OCR) across 19 pipeline configurations varying conversion tool, cleaning, splitting strategy, and metadata enrichment. On a manually curated 50-question benchmark over 36 Portuguese administrative documents (1,706 pages), LLM-as-judge scoring averaged over 10 runs shows Docling with hierarchical splitting and image descriptions reaching 94.1% accuracy, above a naive PDFLoader baseline (86.9%) but below manually curated Markdown (97.1%). The authors conclude that metadata enrichment and hierarchy-aware chunking contribute more to accuracy than conversion framework choice alone, while an exploratory GraphRAG scores only 82%.
Significance. If the accuracy attributions hold, the work supplies a useful empirical benchmark for RAG preprocessing in domain-specific settings, underscoring that data-preparation choices dominate over tool selection. Strengths include explicit baselines, multiple runs, a held-out question set, and a manually curated gold standard that ground the comparisons.
major comments (2)
- [Evaluation] Evaluation section: the claim that metadata enrichment and hierarchy-aware chunking contributed more to accuracy than conversion framework choice is based solely on LLM-as-judge scores averaged over 10 runs on a fixed 50-question set. No human correlation, inter-annotator agreement, per-question variance, confidence intervals, or significance tests are reported, so observed gaps (e.g., Docling hierarchical+images at 94.1%) could arise from judge preference for particular markdown or chunk formats rather than genuine QA quality.
- [Results] Results section: without statistical tests or error bars on the 10-run averages, it is impossible to determine whether differences across the 19 configurations are reliable or could be explained by judge variability alone.
minor comments (1)
- [Abstract] Abstract: 'naïve' is rendered with an escaped quote; standardize to 'naive' for readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our evaluation methodology. We address the major points below and will incorporate statistical enhancements in the revision to improve rigor.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the claim that metadata enrichment and hierarchy-aware chunking contributed more to accuracy than conversion framework choice is based solely on LLM-as-judge scores averaged over 10 runs on a fixed 50-question set. No human correlation, inter-annotator agreement, per-question variance, confidence intervals, or significance tests are reported, so observed gaps (e.g., Docling hierarchical+images at 94.1%) could arise from judge preference for particular markdown or chunk formats rather than genuine QA quality.
Authors: We acknowledge that the attribution of greater impact to metadata enrichment and hierarchy-aware chunking rests on patterns observed across the 19 configurations using LLM-as-judge scores. These patterns emerge from controlled variations where hierarchy and metadata were toggled independently of the conversion tool, with consistent gains (e.g., hierarchical splitting outperforming naive chunking across Docling, MinerU, and Marker). To address the concern, we will add per-configuration standard deviations as error bars, report per-question variance in an appendix, include confidence intervals, and apply paired statistical tests (e.g., t-tests with multiple-comparison correction) to the key differences. We will also expand the limitations section to note that LLM-as-judge may introduce format biases and that human correlation studies remain valuable future work. revision: partial
-
Referee: [Results] Results section: without statistical tests or error bars on the 10-run averages, it is impossible to determine whether differences across the 19 configurations are reliable or could be explained by judge variability alone.
Authors: We agree that the absence of error bars and formal statistical tests limits interpretability of the 10-run averages. In the revised manuscript we will add mean ± standard deviation error bars to all tables and figures, and include appropriate tests (ANOVA followed by post-hoc pairwise comparisons with correction) to establish which differences between the 19 pipelines are statistically significant. This will clarify whether gaps such as the 94.1% vs. 86.9% baseline are robust to judge variability. revision: yes
- A full human evaluation study with inter-annotator agreement to correlate against LLM-as-judge scores, which exceeds the scope and resources of the current revision.
Circularity Check
No circularity: purely empirical benchmark with external baselines
full rationale
The paper conducts a controlled empirical comparison of four PDF-to-Markdown tools across 19 configurations on a fixed 50-question set, reporting LLM-as-judge accuracies against two external baselines (naïve PDFLoader at 86.9% and manual Markdown at 97.1%). No derivations, equations, or first-principles results exist that could reduce to fitted parameters or self-referential definitions. Conclusions about metadata enrichment and hierarchy-aware chunking are drawn directly from observed accuracy deltas, not from any self-citation chain, uniqueness theorem, or ansatz smuggled via prior work. The methodology is self-contained against the held-out question set and does not rename known results or import load-bearing premises from the authors' own citations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-as-judge produces reliable accuracy estimates for RAG outputs
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearMetadata enrichment and hierarchy-aware chunking contributed more to accuracy than the conversion framework choice alone.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclearDocling with hierarchical splitting and image descriptions achieved the highest automated accuracy (94.1%).
Reference graph
Works this paper leans on
-
[1]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W. -t.; Rocktäschel, T. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2020), Online Conference, 2020; pp. 9459-9474, doi:10.48550/arXiv.2005.11401
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.11401 2020
-
[2]
Retrieval-Augmented Generation for Large Language Models: A Survey
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval -Augmented Generation for Large Language Models: A Survey. arXiv 2023, arXiv:2312.10997, doi:10.48550/arXiv.2312.10997
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997 2023
-
[3]
Sharma, P.; Bhattarai, S. A review on Retrieval -Augmented Generation: Architectures, research challenges, and emerging frontiers. Journal of Future Artificial Intelligence and Technologies 2026, 2, 616-628, doi:10.62411/faith.3048-3719-297
-
[4]
Li, H.; Comesana, A.; Weyandt, C.; Hong, T. A RAG data pipeline transforming heterogeneous data into AI -ready format for autonomous building performance discovery. Advances in Applied Energy 2026, 21, 100261, doi:10.1016/j.adapen.2025.100261
-
[5]
Maximizing RAG efficiency: A comparative analysis of RAG methods
Şakar, T.; Emekci, H. Maximizing RAG efficiency: A comparative analysis of RAG methods. Natural Language Processing 2025, 31, 1-25, doi:10.1017/nlp.2024.53
-
[6]
Wagner, R.; Kitzelmann, E.; Boersch, I. Mitigating Hallucination by Integrating Knowledge Graphs into LLM Inference – a Systematic Literature Review. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, July, 2025; pp. 795-805, doi:10.18653/v1/2025.acl-srw.53
-
[7]
Auer, C.; Lysak, M.; Nassar, A.; Dolfi, M.; Livathinos, N.; Vagenas, P.; Berrospi Ramis, C.; Omenetti, M.; Lindlbauer, F.; Dinkla, K.; et al. Docling Technical Report. arXiv 2024, arXiv:2408.09869, doi:10.48550/arXiv.2408.09869
-
[8]
Livathinos, N.; Auer, C.; Lysak, M.; Nassar, A.; Dolfi, M.; Vagenas, P.; Berrospi Ramis, C.; Omenetti, M.; Dinkla, K.; Kim, Y.; et al. Docling: An Efficient Open -Source Toolkit for AI -driven Document Conversion. arXiv 2025, arXiv:2501.17887, doi:10.48550/arXiv.2501.17887
-
[9]
Ouyang, L.; Qu, Y.; Zhou, H.; Zhu, J.; Zhang, R.; Lin, Q.; Wang, B.; Zhao, Z.; Jiang, M.; Zhao, X.; et al. OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations. In Proceedings of the 2025 IEEE/CVF Conference on Computer V ision and Pattern Recognition (CVPR), Nashville (TN), USA, June ,, 2025; pp. 24838 -24848, doi:10.110...
-
[10]
arXiv preprint arXiv:2409.18839 , year=
Wang, B.; Xu, C.; Zhao, X.; Ouyang, L.; Wu, F.; Zhao, Z.; Xu, R.; Liu, K.; Qu, Y.; Shang, F.; et al. MinerU: An open -source solution for precise document content extraction. arXiv 2024, arXiv:2409.18839, doi:10.48550/arXiv.2409.18839
-
[11]
marker-pdf 0.3.2: Convert PDF to markdown with high speed and accuracy
Paruchuri, V.; Kwon, S.; Menta, T.R. marker-pdf 0.3.2: Convert PDF to markdown with high speed and accuracy. Available online: https://github.com/datalab-to/marker (accessed on 21 of 21
-
[12]
DeepSeek-OCR: Contexts Optical Compression
Wei, H.; Sun, Y.; Li, Y. DeepSeek -OCR: Contexts optical compression. arXiv 2025, arXiv:2510.18234, doi:10.48550/arXiv.2510.18234
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.18234 2025
-
[13]
Benchmarking Vision -Language Models for French PDF -to- Markdown Conversion
Rigal, B.; Dupriez, V.; Mignon, A.; Le Hy, R.; Mery, N. Benchmarking Vision -Language Models for French PDF -to- Markdown Conversion. arXiv 2026, arXiv:2602.11960, doi:10.48550/arXiv.2602.11960
-
[14]
READoc: A Unified Benchmark for Realistic Document Structured Extraction
Li, Z.; Abulaiti, A.; Lu, Y.; Chen, X.; Zheng, J.; Lin, H.; Han, X.; Sun, L. READoc: A Unified Benchmark for Realistic Document Structured Extraction. arXiv 2024, arXiv:2409.05137, doi:10.48550/arXiv.2409.05137
-
[15]
An approach based on Open Research Knowledge Graph for Knowledge Acquisition from scientific papers
Jiomekong, A.; Tiwari, S. An approach based on Open Research Knowledge Graph for Knowledge Acquisition from scientific papers. arXiv 2023, arXiv:2308.12981, doi:10.48550/arXiv.2308.12981
-
[16]
Sun, Q.; Luo, Y.; Zhang, W.; Li, S.; Li, J.; Niu, K.; Kong, X.; Liu, W. Docs2KG: Unified knowledge graph construction from heterogeneous documents assisted by Large Language Models. arXiv 2024, arXiv:2406.02962, doi:10.48550/arXiv.2406.02962
-
[17]
Steelman Jr., R.L. Medallion Architecture. In Mastering Snowflake DataOps with DataOps.live: An End-to-End Guide to Modern Data Management; Apress: Berkeley, CA, 2025; pp. 247-264, doi:10.1007/979-8-8688-1754-0_23
-
[18]
Ali, M.; Taha, Z.; Morsey, M.M. Ontology -grounded knowledge graphs for mitigating hallucinations in large language models for clinical question answering. Journal of Biomedical Informatics 2026, 175, 104993, doi:10.1016/j.jbi.2026.104993
-
[19]
Zeng, L.; Lin, X.; Yu, S. Hallucination -resistant multimodal content generation through knowledge graph -based reinforcement learning. Information Fusion 2026, 127, 103783, doi:10.1016/j.inffus.2025.103783
-
[20]
LOSS -J Project Repository — Data Lakehouse Architecture
The LOSS -L Project - Locate Organize Summarize Suggest and Justify. LOSS -J Project Repository — Data Lakehouse Architecture. Available online: https://github.com/sousaalexandre/loss-j/blob/main/docs/data-lakehouse-architecture.pdf (accessed on
-
[21]
LangChain. PyPDFLoader integration. Available online: https://docs.langchain.com/oss/python/integrations/document_loaders/pypdfloader (accessed on
-
[22]
RAGAs: Automated Evaluation of Retrieval Augmented Generation
Es, S.; James, J.; Espinosa Anke, L.; Schockaert, S. RAGAs: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, St. Julians, Malta, March, 2024; pp. 150-158, doi:10.18653/v1/2024.eacl-demo.16
-
[23]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as- a-Judge with MT-Bench and Chatbot Arena. arXiv 2023, arXiv:2306.05685, doi:10.48550/arXiv.2306.05685
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023
-
[24]
OpenAI. gpt-oss-120b. Available online: https://huggingface.co/openai/gpt-oss-120b (accessed on
-
[25]
Core -based Hierarchies for Efficient GraphRAG
Hossain, J.; Erdem Sarıyüce, A. Core -based Hierarchies for Efficient GraphRAG. arXiv 2026, arXiv:2603.05207, doi:10.48550/arXiv.2603.05207
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.