pith. machine review for the scientific record. sign in

arxiv: 2604.04948 · v1 · submitted 2026-03-30 · 💻 cs.IR · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

Alexandre Sousa, Br\'igida M\'onica Faria, Henrique Lopes Cardoso, Jos\'e Duarte, Jos\'e Guilherme Marques dos Santos, Jos\'e Lu\'is Reis, Jos\'e Paulo Marques dos Santos, Lu\'is Paulo Reis, Pedro Pimenta, Ricardo Yang, Rui Humberto Pereira

Pith reviewed 2026-05-14 02:01 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.LG
keywords RAGPDF conversiondocument preprocessingquestion answeringmetadata enrichmenthierarchy-aware chunkingGraphRAGLLM evaluation
0
0 comments X

The pith

Metadata enrichment and hierarchy-aware chunking improve RAG accuracy more than the choice of PDF conversion framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper systematically tests four PDF-to-Markdown tools across 19 pipeline variants on 36 Portuguese administrative documents and a 50-question benchmark. It finds that adding metadata and using hierarchy-aware chunking lifts accuracy to 94.1 percent, well above the 86.9 percent baseline from a naive loader and close to the 97.1 percent from manually curated Markdown. The conversion tool itself matters less than these preprocessing choices. Font-based hierarchy detection beats LLM-based methods, while a basic GraphRAG setup scores only 82 percent.

Core claim

The central claim is that data preparation quality is the dominant factor in RAG performance for domain-specific question answering, with metadata enrichment and hierarchy-aware chunking contributing more to accuracy than the specific PDF conversion framework, as shown by Docling with hierarchical splitting reaching 94.1 percent versus lower scores for other tool combinations.

What carries the argument

Hierarchy-aware chunking paired with metadata enrichment, which uses font information to rebuild document structure and produce better retrieval units.

If this is right

  • Font-based hierarchy rebuilding outperforms LLM-based structure detection.
  • Metadata enrichment and hierarchical splitting raise accuracy substantially over basic loaders.
  • Naive GraphRAG without ontological guidance underperforms standard RAG.
  • Manual curation sets an upper bound at 97.1 percent, leaving room for automated gains.
  • Including image descriptions during conversion aids performance on documents with visuals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar preprocessing emphasis could improve RAG results in other languages or document types.
  • Teams should allocate more effort to chunking and enrichment than to switching conversion tools.
  • Ontology-guided graph construction may be needed to make GraphRAG competitive.
  • Expanding the benchmark or adding human raters would test the stability of the LLM-judge results.

Load-bearing premise

That LLM-as-judge scores on 50 questions reliably measure true downstream question-answering quality without human validation or error bars.

What would settle it

A side-by-side human evaluation of answers from the best automated pipeline and the naive baseline on the same 50 questions.

Figures

Figures reproduced from arXiv: 2604.04948 by Alexandre Sousa, Br\'igida M\'onica Faria, Henrique Lopes Cardoso, Jos\'e Duarte, Jos\'e Guilherme Marques dos Santos, Jos\'e Lu\'is Reis, Jos\'e Paulo Marques dos Santos, Lu\'is Paulo Reis, Pedro Pimenta, Ricardo Yang, Rui Humberto Pereira.

Figure 1
Figure 1. Figure 1: ETL pipeline workflow. Raw PDFs from the Bronze layer are extracted and transformed into intermediate Markdown (Silver layer), then cleaned and finalized into RAG-ready Markdown with extracted assets (Gold layer). The pipeline supports several configurable transformation options designed to ad￾dress known issues in framework outputs: HTML table cleaning (converting HTML tables to Markdown tables), LaTeX fo… view at source ↗
Figure 3
Figure 3. Figure 3: Knowledge graph data model. TextChunk nodes store text content with source metadata and embedding vectors. Entity nodes store a unique identifier, name, and semantic type. MEN￾TIONS relationships link text chunks to the entities extracted from them. RELATED relationships capture semantic connections between entities. A semantic deduplication pipeline was subsequently applied to address entity du￾plication … view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 19 pipeline configurations for extracting text and other contents from PDFs, varying the conversion tool, cleaning transformations, splitting strategy, and metadata enrichment. Evaluation was performed using a manually curated 50-question benchmark over a corpus of 36 Portuguese administrative documents (1,706 pages, ~492K words), with LLM-as-judge scoring averaged over 10 runs. Two baselines bounded the results: na\"ive PDFLoader (86.9%) and manually curated Markdown (97.1%). Docling with hierarchical splitting and image descriptions achieved the highest automated accuracy (94.1%). Metadata enrichment and hierarchy-aware chunking contributed more to accuracy than the conversion framework choice alone. Font-based hierarchy rebuilding consistently outperformed LLM-based approaches. An exploratory GraphRAG implementation scored only 82%, underperforming basic RAG, suggesting that na\"ive knowledge graph construction without ontological guidance does not yet justify its added complexity. These findings demonstrate that data preparation quality is the dominant factor in RAG system performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates four open-source PDF-to-Markdown conversion frameworks (Docling, MinerU, Marker, DeepSeek OCR) across 19 pipeline configurations varying conversion tool, cleaning, splitting strategy, and metadata enrichment. On a manually curated 50-question benchmark over 36 Portuguese administrative documents (1,706 pages), LLM-as-judge scoring averaged over 10 runs shows Docling with hierarchical splitting and image descriptions reaching 94.1% accuracy, above a naive PDFLoader baseline (86.9%) but below manually curated Markdown (97.1%). The authors conclude that metadata enrichment and hierarchy-aware chunking contribute more to accuracy than conversion framework choice alone, while an exploratory GraphRAG scores only 82%.

Significance. If the accuracy attributions hold, the work supplies a useful empirical benchmark for RAG preprocessing in domain-specific settings, underscoring that data-preparation choices dominate over tool selection. Strengths include explicit baselines, multiple runs, a held-out question set, and a manually curated gold standard that ground the comparisons.

major comments (2)
  1. [Evaluation] Evaluation section: the claim that metadata enrichment and hierarchy-aware chunking contributed more to accuracy than conversion framework choice is based solely on LLM-as-judge scores averaged over 10 runs on a fixed 50-question set. No human correlation, inter-annotator agreement, per-question variance, confidence intervals, or significance tests are reported, so observed gaps (e.g., Docling hierarchical+images at 94.1%) could arise from judge preference for particular markdown or chunk formats rather than genuine QA quality.
  2. [Results] Results section: without statistical tests or error bars on the 10-run averages, it is impossible to determine whether differences across the 19 configurations are reliable or could be explained by judge variability alone.
minor comments (1)
  1. [Abstract] Abstract: 'naïve' is rendered with an escaped quote; standardize to 'naive' for readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on our evaluation methodology. We address the major points below and will incorporate statistical enhancements in the revision to improve rigor.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the claim that metadata enrichment and hierarchy-aware chunking contributed more to accuracy than conversion framework choice is based solely on LLM-as-judge scores averaged over 10 runs on a fixed 50-question set. No human correlation, inter-annotator agreement, per-question variance, confidence intervals, or significance tests are reported, so observed gaps (e.g., Docling hierarchical+images at 94.1%) could arise from judge preference for particular markdown or chunk formats rather than genuine QA quality.

    Authors: We acknowledge that the attribution of greater impact to metadata enrichment and hierarchy-aware chunking rests on patterns observed across the 19 configurations using LLM-as-judge scores. These patterns emerge from controlled variations where hierarchy and metadata were toggled independently of the conversion tool, with consistent gains (e.g., hierarchical splitting outperforming naive chunking across Docling, MinerU, and Marker). To address the concern, we will add per-configuration standard deviations as error bars, report per-question variance in an appendix, include confidence intervals, and apply paired statistical tests (e.g., t-tests with multiple-comparison correction) to the key differences. We will also expand the limitations section to note that LLM-as-judge may introduce format biases and that human correlation studies remain valuable future work. revision: partial

  2. Referee: [Results] Results section: without statistical tests or error bars on the 10-run averages, it is impossible to determine whether differences across the 19 configurations are reliable or could be explained by judge variability alone.

    Authors: We agree that the absence of error bars and formal statistical tests limits interpretability of the 10-run averages. In the revised manuscript we will add mean ± standard deviation error bars to all tables and figures, and include appropriate tests (ANOVA followed by post-hoc pairwise comparisons with correction) to establish which differences between the 19 pipelines are statistically significant. This will clarify whether gaps such as the 94.1% vs. 86.9% baseline are robust to judge variability. revision: yes

standing simulated objections not resolved
  • A full human evaluation study with inter-annotator agreement to correlate against LLM-as-judge scores, which exceeds the scope and resources of the current revision.

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with external baselines

full rationale

The paper conducts a controlled empirical comparison of four PDF-to-Markdown tools across 19 configurations on a fixed 50-question set, reporting LLM-as-judge accuracies against two external baselines (naïve PDFLoader at 86.9% and manual Markdown at 97.1%). No derivations, equations, or first-principles results exist that could reduce to fitted parameters or self-referential definitions. Conclusions about metadata enrichment and hierarchy-aware chunking are drawn directly from observed accuracy deltas, not from any self-citation chain, uniqueness theorem, or ansatz smuggled via prior work. The methodology is self-contained against the held-out question set and does not rename known results or import load-bearing premises from the authors' own citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the standard assumption that LLM-as-judge scores correlate with human judgments of answer quality and that the 50-question set is representative of real user needs in the domain.

axioms (1)
  • domain assumption LLM-as-judge produces reliable accuracy estimates for RAG outputs
    Used to score all 19 configurations and baselines without reported human validation

pith-pipeline@v0.9.0 · 5599 in / 1172 out tokens · 58659 ms · 2026-05-14T02:01:10.447492+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 4 internal anchors

  1. [1]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W. -t.; Rocktäschel, T. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2020), Online Conference, 2020; pp. 9459-9474, doi:10.48550/arXiv.2005.11401

  2. [2]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval -Augmented Generation for Large Language Models: A Survey. arXiv 2023, arXiv:2312.10997, doi:10.48550/arXiv.2312.10997

  3. [3]

    A review on Retrieval -Augmented Generation: Architectures, research challenges, and emerging frontiers

    Sharma, P.; Bhattarai, S. A review on Retrieval -Augmented Generation: Architectures, research challenges, and emerging frontiers. Journal of Future Artificial Intelligence and Technologies 2026, 2, 616-628, doi:10.62411/faith.3048-3719-297

  4. [4]

    A RAG data pipeline transforming heterogeneous data into AI -ready format for autonomous building performance discovery

    Li, H.; Comesana, A.; Weyandt, C.; Hong, T. A RAG data pipeline transforming heterogeneous data into AI -ready format for autonomous building performance discovery. Advances in Applied Energy 2026, 21, 100261, doi:10.1016/j.adapen.2025.100261

  5. [5]

    Maximizing RAG efficiency: A comparative analysis of RAG methods

    Şakar, T.; Emekci, H. Maximizing RAG efficiency: A comparative analysis of RAG methods. Natural Language Processing 2025, 31, 1-25, doi:10.1017/nlp.2024.53

  6. [6]

    Mitigating Hallucination by Integrating Knowledge Graphs into LLM Inference – a Systematic Literature Review

    Wagner, R.; Kitzelmann, E.; Boersch, I. Mitigating Hallucination by Integrating Knowledge Graphs into LLM Inference – a Systematic Literature Review. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, July, 2025; pp. 795-805, doi:10.18653/v1/2025.acl-srw.53

  7. [7]

    Docling Technical Report

    Auer, C.; Lysak, M.; Nassar, A.; Dolfi, M.; Livathinos, N.; Vagenas, P.; Berrospi Ramis, C.; Omenetti, M.; Lindlbauer, F.; Dinkla, K.; et al. Docling Technical Report. arXiv 2024, arXiv:2408.09869, doi:10.48550/arXiv.2408.09869

  8. [8]

    Do- cling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025

    Livathinos, N.; Auer, C.; Lysak, M.; Nassar, A.; Dolfi, M.; Vagenas, P.; Berrospi Ramis, C.; Omenetti, M.; Dinkla, K.; Kim, Y.; et al. Docling: An Efficient Open -Source Toolkit for AI -driven Document Conversion. arXiv 2025, arXiv:2501.17887, doi:10.48550/arXiv.2501.17887

  9. [9]

    Patel, and Shao-Yuan Lo

    Ouyang, L.; Qu, Y.; Zhou, H.; Zhu, J.; Zhang, R.; Lin, Q.; Wang, B.; Zhao, Z.; Jiang, M.; Zhao, X.; et al. OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations. In Proceedings of the 2025 IEEE/CVF Conference on Computer V ision and Pattern Recognition (CVPR), Nashville (TN), USA, June ,, 2025; pp. 24838 -24848, doi:10.110...

  10. [10]

    arXiv preprint arXiv:2409.18839 , year=

    Wang, B.; Xu, C.; Zhao, X.; Ouyang, L.; Wu, F.; Zhao, Z.; Xu, R.; Liu, K.; Qu, Y.; Shang, F.; et al. MinerU: An open -source solution for precise document content extraction. arXiv 2024, arXiv:2409.18839, doi:10.48550/arXiv.2409.18839

  11. [11]

    marker-pdf 0.3.2: Convert PDF to markdown with high speed and accuracy

    Paruchuri, V.; Kwon, S.; Menta, T.R. marker-pdf 0.3.2: Convert PDF to markdown with high speed and accuracy. Available online: https://github.com/datalab-to/marker (accessed on 21 of 21

  12. [12]

    DeepSeek-OCR: Contexts Optical Compression

    Wei, H.; Sun, Y.; Li, Y. DeepSeek -OCR: Contexts optical compression. arXiv 2025, arXiv:2510.18234, doi:10.48550/arXiv.2510.18234

  13. [13]

    Benchmarking Vision -Language Models for French PDF -to- Markdown Conversion

    Rigal, B.; Dupriez, V.; Mignon, A.; Le Hy, R.; Mery, N. Benchmarking Vision -Language Models for French PDF -to- Markdown Conversion. arXiv 2026, arXiv:2602.11960, doi:10.48550/arXiv.2602.11960

  14. [14]

    READoc: A Unified Benchmark for Realistic Document Structured Extraction

    Li, Z.; Abulaiti, A.; Lu, Y.; Chen, X.; Zheng, J.; Lin, H.; Han, X.; Sun, L. READoc: A Unified Benchmark for Realistic Document Structured Extraction. arXiv 2024, arXiv:2409.05137, doi:10.48550/arXiv.2409.05137

  15. [15]

    An approach based on Open Research Knowledge Graph for Knowledge Acquisition from scientific papers

    Jiomekong, A.; Tiwari, S. An approach based on Open Research Knowledge Graph for Knowledge Acquisition from scientific papers. arXiv 2023, arXiv:2308.12981, doi:10.48550/arXiv.2308.12981

  16. [16]

    Docs2KG: Unified knowledge graph construction from heterogeneous documents assisted by Large Language Models

    Sun, Q.; Luo, Y.; Zhang, W.; Li, S.; Li, J.; Niu, K.; Kong, X.; Liu, W. Docs2KG: Unified knowledge graph construction from heterogeneous documents assisted by Large Language Models. arXiv 2024, arXiv:2406.02962, doi:10.48550/arXiv.2406.02962

  17. [17]

    Medallion Architecture

    Steelman Jr., R.L. Medallion Architecture. In Mastering Snowflake DataOps with DataOps.live: An End-to-End Guide to Modern Data Management; Apress: Berkeley, CA, 2025; pp. 247-264, doi:10.1007/979-8-8688-1754-0_23

  18. [18]

    Ontology -grounded knowledge graphs for mitigating hallucinations in large language models for clinical question answering

    Ali, M.; Taha, Z.; Morsey, M.M. Ontology -grounded knowledge graphs for mitigating hallucinations in large language models for clinical question answering. Journal of Biomedical Informatics 2026, 175, 104993, doi:10.1016/j.jbi.2026.104993

  19. [19]

    Hallucination -resistant multimodal content generation through knowledge graph -based reinforcement learning

    Zeng, L.; Lin, X.; Yu, S. Hallucination -resistant multimodal content generation through knowledge graph -based reinforcement learning. Information Fusion 2026, 127, 103783, doi:10.1016/j.inffus.2025.103783

  20. [20]

    LOSS -J Project Repository — Data Lakehouse Architecture

    The LOSS -L Project - Locate Organize Summarize Suggest and Justify. LOSS -J Project Repository — Data Lakehouse Architecture. Available online: https://github.com/sousaalexandre/loss-j/blob/main/docs/data-lakehouse-architecture.pdf (accessed on

  21. [21]

    PyPDFLoader integration

    LangChain. PyPDFLoader integration. Available online: https://docs.langchain.com/oss/python/integrations/document_loaders/pypdfloader (accessed on

  22. [22]

    RAGAs: Automated Evaluation of Retrieval Augmented Generation

    Es, S.; James, J.; Espinosa Anke, L.; Schockaert, S. RAGAs: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, St. Julians, Malta, March, 2024; pp. 150-158, doi:10.18653/v1/2024.eacl-demo.16

  23. [23]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as- a-Judge with MT-Bench and Chatbot Arena. arXiv 2023, arXiv:2306.05685, doi:10.48550/arXiv.2306.05685

  24. [24]

    gpt-oss-120b

    OpenAI. gpt-oss-120b. Available online: https://huggingface.co/openai/gpt-oss-120b (accessed on

  25. [25]

    Core -based Hierarchies for Efficient GraphRAG

    Hossain, J.; Erdem Sarıyüce, A. Core -based Hierarchies for Efficient GraphRAG. arXiv 2026, arXiv:2603.05207, doi:10.48550/arXiv.2603.05207