Recognition: unknown
Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI
Pith reviewed 2026-05-07 11:35 UTC · model grok-4.3
The pith
Enterprise document AI pipelines show weak stage correlations and low answer completeness despite high factual accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EnterpriseDocBench measures parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness on the same corpus. Hybrid retrieval reaches nDCG@5 of 0.92, narrowly ahead of BM25 at 0.91 and well above dense embeddings at 0.83. Cross-stage correlations are weak at r values of 0.14, 0.17, and 0.02, factual accuracy on stated claims is 85.5 percent, yet answer completeness averages 0.40, and hallucination is higher for both short and very long documents than for medium ones.
What carries the argument
EnterpriseDocBench, the end-to-end benchmark that applies consistent metrics for parsing fidelity, retrieval relevance, and generation groundedness to the same set of documents.
If this is right
- Optimizing any single stage in isolation is unlikely to produce large gains in overall pipeline performance.
- Systems must add mechanisms that improve completeness rather than relying on factual accuracy alone.
- Hallucination control strategies need to account for non-monotonic effects of document length.
- Reference architectures such as ColPali and agentic routing require full integration into the benchmark before their end-to-end value can be measured.
Where Pith is reading between the lines
- The low completeness numbers suggest that future pipelines may benefit from explicit planning steps that decide what information to surface.
- Releasing the framework and scripts openly could allow testing on proprietary enterprise collections to check whether the weak correlations persist outside public data.
- The surprise finding that short documents hallucinate more than medium ones could motivate length-specific preprocessing rules in production systems.
Load-bearing premise
The chosen automated proxy metrics for each stage and the public document corpus sufficiently represent real enterprise document processing challenges and user needs.
What would settle it
Finding strong positive correlations between parsing, retrieval, and generation metrics or completeness scores above 0.7 on the same corpus with comparable pipelines would undermine the claims of weak linkages and the accuracy-completeness gap.
read the original abstract
Most enterprise document AI today is a pipeline. Parse, index, retrieve, generate. Each of those stages has been studied to death on its own -- what's still hard is evaluating the system as a whole. We built EnterpriseDocBench to take a swing at it: parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness, all on the same corpus. The corpus is built from public, permissively licensed documents across six enterprise domains (five represented in the current pilot). We ran three pipelines through it -- BM25, dense embedding, and a hybrid -- all with the same GPT-5 generator. The headline numbers: hybrid retrieval narrowly beats BM25 (nDCG@5 of 0.92 vs. 0.91), and both beat dense embedding (0.83). Hallucination doesn't grow monotonically with document length -- short documents and very long ones both hallucinate more than medium ones (28.1% and 23.8% vs. 9.2%). Cross-stage correlations are very weak: parsing->retrieval r=0.14, parsing->generation r=0.17, retrieval->generation 0.02. If quality were cascading the way most of us assume, those numbers would be much higher; they aren't. Design caveats are real (parsing fixed, generator shared, automated proxy metrics) and we don't oversell the result. One result that genuinely surprised us: factual accuracy on stated claims is 85.5%, but answer completeness averages 0.40. The system is right when it answers -- it just leaves things out. That gap matters more for real deployments than the headline accuracy number does. We also describe three reference architectures (ColPali, ColQwen2, agentic complexity-based routing) which are not yet integrated end-to-end. Framework, metrics, baselines, and collection scripts will be released open-source on acceptance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EnterpriseDocBench, a unified evaluation framework for end-to-end multimodal document processing pipelines (parse, index, retrieve, generate). It uses a public, permissively licensed corpus across six enterprise domains (five in the pilot) to benchmark three retrieval pipelines (BM25, dense embedding, hybrid) with a fixed GPT-5 generator. Headline results include hybrid nDCG@5 of 0.92 narrowly exceeding BM25 (0.91) and dense (0.83), factual accuracy of 85.5% versus average completeness of 0.40, non-monotonic hallucination rates with document length, and weak cross-stage correlations (parsing-retrieval r=0.14, parsing-generation r=0.17, retrieval-generation r=0.02). The work flags design caveats around fixed parsing, shared generator, and automated proxy metrics, and commits to open-sourcing the framework, metrics, baselines, and collection scripts.
Significance. If the proxy metrics are shown to track human judgments of enterprise document tasks, the weak inter-stage correlations would meaningfully challenge the common assumption of cascading quality in pipelines, while the accuracy-completeness gap would highlight a practical deployment priority. The public benchmark and planned open-source release constitute a concrete contribution by enabling reproducible end-to-end evaluation, an area previously dominated by isolated stage studies. These elements provide a foundation for future work even if the current corpus and metrics require further validation.
major comments (3)
- [Abstract] Abstract: The claim that hybrid retrieval 'narrowly beats' BM25 (nDCG@5 0.92 vs. 0.91) is presented without error bars, confidence intervals, or statistical significance tests; this is load-bearing for the headline retrieval comparison and prevents assessment of whether the 0.01 difference is reliable.
- [Evaluation Framework and Metrics sections] Evaluation Framework and Metrics sections: All central claims (weak cross-stage correlations, factual accuracy 85.5% vs. completeness 0.40, non-monotonic hallucination) rest on automated proxy metrics for parsing fidelity, retrieval relevance, and generation groundedness, yet no human correlation, inter-annotator agreement, or comparison to proprietary enterprise data is reported; without this, the proxies cannot be confirmed as valid stand-ins for real enterprise challenges.
- [Corpus Construction section] Corpus Construction section: The public multi-domain corpus is central to generalizability claims, but details on document selection criteria, total size, domain distribution, multimodal handling, and how it represents enterprise document processing are insufficient to evaluate whether results transfer beyond the pilot.
minor comments (2)
- [Abstract] Abstract: The parenthetical '(five represented in the current pilot)' for the six domains is unclear without naming the domains or specifying the pilot scope.
- [Future Work paragraph] Future Work paragraph: The three reference architectures (ColPali, ColQwen2, agentic complexity-based routing) are listed as not yet integrated; a short note on the technical or scope reasons for exclusion would clarify the current evaluation boundaries.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and describe the revisions we will make to improve statistical reporting, metric discussion, and corpus documentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that hybrid retrieval 'narrowly beats' BM25 (nDCG@5 0.92 vs. 0.91) is presented without error bars, confidence intervals, or statistical significance tests; this is load-bearing for the headline retrieval comparison and prevents assessment of whether the 0.01 difference is reliable.
Authors: We agree that the small observed difference requires statistical support for reliable interpretation. In the revised manuscript we will add bootstrap confidence intervals for all nDCG@5 scores and report the results of paired significance tests (e.g., Wilcoxon signed-rank) between the three retrieval pipelines. These additions will allow readers to assess whether the 0.01 margin is statistically meaningful. revision: yes
-
Referee: [Evaluation Framework and Metrics sections] Evaluation Framework and Metrics sections: All central claims (weak cross-stage correlations, factual accuracy 85.5% vs. completeness 0.40, non-monotonic hallucination) rest on automated proxy metrics for parsing fidelity, retrieval relevance, and generation groundedness, yet no human correlation, inter-annotator agreement, or comparison to proprietary enterprise data is reported; without this, the proxies cannot be confirmed as valid stand-ins for real enterprise challenges.
Authors: The referee correctly highlights a limitation of the current evaluation. While the paper already cites literature on the correlation of similar automated metrics with human judgments, we did not perform a dedicated human validation study on EnterpriseDocBench. In revision we will expand the Limitations and Metrics sections to discuss the strength of existing proxy validations in the literature and to state explicitly that our headline numbers should be treated as indicative pending direct human correlation. A small-scale human annotation pilot is under consideration for a follow-up release but is not feasible within the current revision timeline. revision: partial
-
Referee: [Corpus Construction section] Corpus Construction section: The public multi-domain corpus is central to generalizability claims, but details on document selection criteria, total size, domain distribution, multimodal handling, and how it represents enterprise document processing are insufficient to evaluate whether results transfer beyond the pilot.
Authors: We acknowledge that the current description of the corpus is high-level. The revised manuscript will include an expanded Corpus Construction section with a table reporting total documents and pages per domain, explicit selection criteria (length, presence of tables/figures, licensing), and statistics on multimodal elements. We will also clarify how the six domains were chosen to reflect common enterprise document types such as reports, contracts, and manuals. revision: yes
Circularity Check
No circularity: direct empirical measurements on fixed pipelines
full rationale
The paper reports results from running three retrieval pipelines (BM25, dense, hybrid) plus a shared generator on a new public corpus, computing standard metrics (nDCG@5, factual accuracy, completeness, Pearson correlations) directly from those runs. No equations, fitted parameters, or predictions are defined in terms of the outputs they claim to derive; the abstract and text explicitly flag design choices (fixed parsing, shared generator, automated proxies) as caveats rather than deriving claims from them. No self-citations or uniqueness theorems are invoked to support central results. The work is self-contained as an empirical benchmark release.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard IR metrics such as nDCG@5 are appropriate proxies for retrieval quality in document pipelines.
invented entities (1)
-
EnterpriseDocBench
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Longbench: A bilingual, multitask benchmark for long context understanding
Yushi Bai, Xin Lv, Jiajie Zhang, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024
2024
-
[2]
Ms marco: A human generated machine reading comprehension dataset
Payal Bajaj et al. Ms marco: A human generated machine reading comprehension dataset. In NeurIPS Workshop on Cognitive Computation, 2016. 13
2016
-
[3]
Hallulens: Llm hallucination benchmark
Yejin Bang et al. Hallulens: Llm hallucination benchmark. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2025
2025
-
[4]
M3docrag: Multi-modal multi-page document reasoning and retrieval
Jaemin Cho et al. M3docrag: Multi-modal multi-page document reasoning and retrieval. In Advances in Neural Information Processing Systems (NeurIPS), 2025
2025
-
[5]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019
2019
-
[6]
RAGAS: Automated evalu- ation of retrieval-augmented generation.arXiv preprint arXiv:2309.15217,
Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation.arXiv preprint arXiv:2309.15217, 2023
-
[7]
Colpali: Efficient document retrieval with vision language models
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. In International Conference on Learning Representations (ICLR), 2025
2025
-
[8]
Rarr: Researching and revising what language models say, using language models
Luyu Gao, Zhuyun Dai, Panupong Pasupat, et al. Rarr: Researching and revising what language models say, using language models. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2023
2023
-
[9]
Vectara hallucination evaluation model leaderboard
Anand Hughes et al. Vectara hallucination evaluation model leaderboard. Technical report, Vectara, 2025
2025
-
[10]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
2020
-
[11]
Fact, fetch, and reason: A unified evaluation of retrieval- augmented generation
Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval- augmented generation. InProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025
2025
-
[12]
Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics (TACL), 2019
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, et al. Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics (TACL), 2019
2019
-
[13]
Document haystack: End-to-end evaluation of information retrieval
Philippe Laban et al. Document haystack: End-to-end evaluation of information retrieval. arXiv preprint arXiv:2503.11438, 2025
-
[14]
Tablebank: A benchmark dataset for table detection and recognition
Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. Tablebank: A benchmark dataset for table detection and recognition. InLanguage Resources and Evaluation Conference (LREC), 2020
2020
-
[15]
Large language models in document intelligence: A comprehensive survey, recent advances, challenges and future trends.ACM Transactions on Information Systems, 2026
Hao Liu et al. Large language models in document intelligence: A comprehensive survey, recent advances, challenges and future trends.ACM Transactions on Information Systems, 2026
2026
-
[16]
Yang Liu et al. Sf-rag: Structure-fidelity retrieval-augmented generation for academic question answering.arXiv preprint arXiv:2602.13647, 2026
-
[17]
Docvqa: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, and C V Jawahar. Docvqa: A dataset for vqa on document images. InIEEE/CVF Winter Conference on Applications of Computer Vision (W ACV), 2021. 14
2021
-
[18]
OpenDataLab. Agenticocr: Parsing only what you need for efficient retrieval-augmented generation.arXiv preprint arXiv:2602.24134, 2026
-
[19]
Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations
Linke Ouyang, Bin Wang, Junyuan Zhang, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[20]
Readoc: A unified benchmark for realistic document structured extraction
Junyu Qi et al. Readoc: A unified benchmark for realistic document structured extraction. In Findings of the Association for Computational Linguistics (ACL), 2025
2025
-
[21]
Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research (JMLR), 2020
Colin Raffel, Noam Shazeer, Adam Roberts, et al. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research (JMLR), 2020
2020
-
[22]
Ares: An automated evaluation framework for retrieval-augmented generation systems
Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. Ares: An automated evaluation framework for retrieval-augmented generation systems. InProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024
2024
-
[23]
Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021
2021
-
[24]
Fever: a large-scale dataset for fact extraction and verification
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. InProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2018
2018
-
[25]
Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation
Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, and Wentao Zhang. Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation. InIEEE/CVF International Conference on Computer Vision (ICCV), 2025
2025
-
[26]
Weinberger, and Yoav Artzi
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InInternational Conference on Learning Representations (ICLR), 2020
2020
-
[27]
Zhiyuan Zheng et al. Docling: Efficient document processing for ai.arXiv preprint arXiv:2408.09869, 2024
-
[28]
Publaynet: Largest dataset ever for document layout analysis
Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: Largest dataset ever for document layout analysis. InInternational Conference on Document Analysis and Recognition (ICDAR), 2019. A Pipeline Hyperparameters Agentic routing classifier (proposed).A logistic regression model trained on query-complexity labels. Planned features: query length in tok...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.