arxiv: 2604.26382 · v1 · submitted 2026-04-29 · 💻 cs.CL · cs.AI· cs.IR

Recognition: unknown

Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI

Sachin Raj, Saurabh K. Singh

Authors on Pith no claims yet

Pith reviewed 2026-05-07 11:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords document AIpipeline evaluationenterprise AIretrieval augmented generationmultimodal document processingbenchmarkingparsing fidelitygeneration groundedness

0 comments

The pith

Enterprise document AI pipelines show weak stage correlations and low answer completeness despite high factual accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EnterpriseDocBench as a unified framework to evaluate full document processing pipelines from parsing through generation on one shared public corpus drawn from enterprise domains. Experiments with BM25, dense embedding, and hybrid retrieval using a fixed generator reveal that hybrid retrieval edges out the others on relevance but that correlations across parsing, retrieval, and generation stages are very low. The work also finds that factual accuracy reaches 85.5 percent while average answer completeness sits at 0.40, and that hallucination rates do not rise steadily with document length. These results indicate that quality does not cascade reliably through pipeline stages as is often assumed, making end-to-end benchmarks necessary to identify where real performance gaps occur.

Core claim

EnterpriseDocBench measures parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness on the same corpus. Hybrid retrieval reaches nDCG@5 of 0.92, narrowly ahead of BM25 at 0.91 and well above dense embeddings at 0.83. Cross-stage correlations are weak at r values of 0.14, 0.17, and 0.02, factual accuracy on stated claims is 85.5 percent, yet answer completeness averages 0.40, and hallucination is higher for both short and very long documents than for medium ones.

What carries the argument

EnterpriseDocBench, the end-to-end benchmark that applies consistent metrics for parsing fidelity, retrieval relevance, and generation groundedness to the same set of documents.

If this is right

Optimizing any single stage in isolation is unlikely to produce large gains in overall pipeline performance.
Systems must add mechanisms that improve completeness rather than relying on factual accuracy alone.
Hallucination control strategies need to account for non-monotonic effects of document length.
Reference architectures such as ColPali and agentic routing require full integration into the benchmark before their end-to-end value can be measured.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The low completeness numbers suggest that future pipelines may benefit from explicit planning steps that decide what information to surface.
Releasing the framework and scripts openly could allow testing on proprietary enterprise collections to check whether the weak correlations persist outside public data.
The surprise finding that short documents hallucinate more than medium ones could motivate length-specific preprocessing rules in production systems.

Load-bearing premise

The chosen automated proxy metrics for each stage and the public document corpus sufficiently represent real enterprise document processing challenges and user needs.

What would settle it

Finding strong positive correlations between parsing, retrieval, and generation metrics or completeness scores above 0.7 on the same corpus with comparable pipelines would undermine the claims of weak linkages and the accuracy-completeness gap.

read the original abstract

Most enterprise document AI today is a pipeline. Parse, index, retrieve, generate. Each of those stages has been studied to death on its own -- what's still hard is evaluating the system as a whole. We built EnterpriseDocBench to take a swing at it: parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness, all on the same corpus. The corpus is built from public, permissively licensed documents across six enterprise domains (five represented in the current pilot). We ran three pipelines through it -- BM25, dense embedding, and a hybrid -- all with the same GPT-5 generator. The headline numbers: hybrid retrieval narrowly beats BM25 (nDCG@5 of 0.92 vs. 0.91), and both beat dense embedding (0.83). Hallucination doesn't grow monotonically with document length -- short documents and very long ones both hallucinate more than medium ones (28.1% and 23.8% vs. 9.2%). Cross-stage correlations are very weak: parsing->retrieval r=0.14, parsing->generation r=0.17, retrieval->generation 0.02. If quality were cascading the way most of us assume, those numbers would be much higher; they aren't. Design caveats are real (parsing fixed, generator shared, automated proxy metrics) and we don't oversell the result. One result that genuinely surprised us: factual accuracy on stated claims is 85.5%, but answer completeness averages 0.40. The system is right when it answers -- it just leaves things out. That gap matters more for real deployments than the headline accuracy number does. We also describe three reference architectures (ColPali, ColQwen2, agentic complexity-based routing) which are not yet integrated end-to-end. Framework, metrics, baselines, and collection scripts will be released open-source on acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EnterpriseDocBench makes a reasonable first pass at end-to-end pipeline evaluation with some interesting observations on weak stage correlations, but the proxy metrics and public corpus leave the main claims on shaky ground.

read the letter

The main takeaway is that this paper introduces EnterpriseDocBench to evaluate full document pipelines from parse through generate on a shared corpus, and the low cross-stage correlations plus the accuracy-completeness gap are the parts worth noting if the measurements hold up. It does a decent job highlighting that single-stage wins do not automatically translate to system-level improvements, which is a point that gets lost when people optimize retrieval or generation in isolation. Running the same three retrieval setups against the same generator on documents from multiple domains gives a direct comparison, and the non-monotonic hallucination pattern by document length is a concrete detail that could matter for deployment choices. The plan to release the framework, metrics, and scripts is also useful for anyone who wants to extend or reproduce the work. The soft spots sit mostly in the measurement layer. The abstract flags automated proxy metrics as a design caveat, yet there is no reported check on how well those proxies align with human judgments of parsing fidelity or groundedness. The corpus draws from public documents, which is practical for an initial release but leaves open whether it reflects the format noise, domain specificity, or query patterns typical in enterprise settings. The hybrid versus BM25 nDCG difference is tiny enough that it could disappear with variance or different seeds, and the lack of error bars or statistical tests makes it hard to treat as a reliable ranking. The weak correlations are intriguing under the usual assumption of cascading quality, but they only become evidence against that assumption once the proxies are shown to track the intended constructs. This is the sort of paper that would interest researchers building or evaluating document AI systems who care about holistic benchmarks rather than component papers. A reading group could usefully discuss the metric definitions and whether the public corpus is sufficient. It deserves peer review so the methods section can be examined in detail and the validation gaps addressed, even though the current version is still preliminary.

Referee Report

3 major / 2 minor

Summary. The paper introduces EnterpriseDocBench, a unified evaluation framework for end-to-end multimodal document processing pipelines (parse, index, retrieve, generate). It uses a public, permissively licensed corpus across six enterprise domains (five in the pilot) to benchmark three retrieval pipelines (BM25, dense embedding, hybrid) with a fixed GPT-5 generator. Headline results include hybrid nDCG@5 of 0.92 narrowly exceeding BM25 (0.91) and dense (0.83), factual accuracy of 85.5% versus average completeness of 0.40, non-monotonic hallucination rates with document length, and weak cross-stage correlations (parsing-retrieval r=0.14, parsing-generation r=0.17, retrieval-generation r=0.02). The work flags design caveats around fixed parsing, shared generator, and automated proxy metrics, and commits to open-sourcing the framework, metrics, baselines, and collection scripts.

Significance. If the proxy metrics are shown to track human judgments of enterprise document tasks, the weak inter-stage correlations would meaningfully challenge the common assumption of cascading quality in pipelines, while the accuracy-completeness gap would highlight a practical deployment priority. The public benchmark and planned open-source release constitute a concrete contribution by enabling reproducible end-to-end evaluation, an area previously dominated by isolated stage studies. These elements provide a foundation for future work even if the current corpus and metrics require further validation.

major comments (3)

[Abstract] Abstract: The claim that hybrid retrieval 'narrowly beats' BM25 (nDCG@5 0.92 vs. 0.91) is presented without error bars, confidence intervals, or statistical significance tests; this is load-bearing for the headline retrieval comparison and prevents assessment of whether the 0.01 difference is reliable.
[Evaluation Framework and Metrics sections] Evaluation Framework and Metrics sections: All central claims (weak cross-stage correlations, factual accuracy 85.5% vs. completeness 0.40, non-monotonic hallucination) rest on automated proxy metrics for parsing fidelity, retrieval relevance, and generation groundedness, yet no human correlation, inter-annotator agreement, or comparison to proprietary enterprise data is reported; without this, the proxies cannot be confirmed as valid stand-ins for real enterprise challenges.
[Corpus Construction section] Corpus Construction section: The public multi-domain corpus is central to generalizability claims, but details on document selection criteria, total size, domain distribution, multimodal handling, and how it represents enterprise document processing are insufficient to evaluate whether results transfer beyond the pilot.

minor comments (2)

[Abstract] Abstract: The parenthetical '(five represented in the current pilot)' for the six domains is unclear without naming the domains or specifying the pilot scope.
[Future Work paragraph] Future Work paragraph: The three reference architectures (ColPali, ColQwen2, agentic complexity-based routing) are listed as not yet integrated; a short note on the technical or scope reasons for exclusion would clarify the current evaluation boundaries.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and describe the revisions we will make to improve statistical reporting, metric discussion, and corpus documentation.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that hybrid retrieval 'narrowly beats' BM25 (nDCG@5 0.92 vs. 0.91) is presented without error bars, confidence intervals, or statistical significance tests; this is load-bearing for the headline retrieval comparison and prevents assessment of whether the 0.01 difference is reliable.

Authors: We agree that the small observed difference requires statistical support for reliable interpretation. In the revised manuscript we will add bootstrap confidence intervals for all nDCG@5 scores and report the results of paired significance tests (e.g., Wilcoxon signed-rank) between the three retrieval pipelines. These additions will allow readers to assess whether the 0.01 margin is statistically meaningful. revision: yes
Referee: [Evaluation Framework and Metrics sections] Evaluation Framework and Metrics sections: All central claims (weak cross-stage correlations, factual accuracy 85.5% vs. completeness 0.40, non-monotonic hallucination) rest on automated proxy metrics for parsing fidelity, retrieval relevance, and generation groundedness, yet no human correlation, inter-annotator agreement, or comparison to proprietary enterprise data is reported; without this, the proxies cannot be confirmed as valid stand-ins for real enterprise challenges.

Authors: The referee correctly highlights a limitation of the current evaluation. While the paper already cites literature on the correlation of similar automated metrics with human judgments, we did not perform a dedicated human validation study on EnterpriseDocBench. In revision we will expand the Limitations and Metrics sections to discuss the strength of existing proxy validations in the literature and to state explicitly that our headline numbers should be treated as indicative pending direct human correlation. A small-scale human annotation pilot is under consideration for a follow-up release but is not feasible within the current revision timeline. revision: partial
Referee: [Corpus Construction section] Corpus Construction section: The public multi-domain corpus is central to generalizability claims, but details on document selection criteria, total size, domain distribution, multimodal handling, and how it represents enterprise document processing are insufficient to evaluate whether results transfer beyond the pilot.

Authors: We acknowledge that the current description of the corpus is high-level. The revised manuscript will include an expanded Corpus Construction section with a table reporting total documents and pages per domain, explicit selection criteria (length, presence of tables/figures, licensing), and statistics on multimodal elements. We will also clarify how the six domains were chosen to reflect common enterprise document types such as reports, contracts, and manuals. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on fixed pipelines

full rationale

The paper reports results from running three retrieval pipelines (BM25, dense, hybrid) plus a shared generator on a new public corpus, computing standard metrics (nDCG@5, factual accuracy, completeness, Pearson correlations) directly from those runs. No equations, fitted parameters, or predictions are defined in terms of the outputs they claim to derive; the abstract and text explicitly flag design choices (fixed parsing, shared generator, automated proxies) as caveats rather than deriving claims from them. No self-citations or uniqueness theorems are invoked to support central results. The work is self-contained as an empirical benchmark release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

This is an empirical benchmarking study that introduces a new evaluation framework and reports direct measurements. No mathematical derivations or free parameters are used in the central claims.

axioms (1)

domain assumption Standard IR metrics such as nDCG@5 are appropriate proxies for retrieval quality in document pipelines.
Invoked when reporting nDCG@5 scores for the three pipelines.

invented entities (1)

EnterpriseDocBench no independent evidence
purpose: Unified corpus and metrics for end-to-end evaluation of multimodal document processing pipelines.
Newly constructed benchmark from public documents across enterprise domains.

pith-pipeline@v0.9.0 · 5663 in / 1314 out tokens · 47153 ms · 2026-05-07T11:35:16.821990+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 5 canonical work pages

[1]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024

2024
[2]

Ms marco: A human generated machine reading comprehension dataset

Payal Bajaj et al. Ms marco: A human generated machine reading comprehension dataset. In NeurIPS Workshop on Cognitive Computation, 2016. 13

2016
[3]

Hallulens: Llm hallucination benchmark

Yejin Bang et al. Hallulens: Llm hallucination benchmark. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2025

2025
[4]

M3docrag: Multi-modal multi-page document reasoning and retrieval

Jaemin Cho et al. M3docrag: Multi-modal multi-page document reasoning and retrieval. In Advances in Neural Information Processing Systems (NeurIPS), 2025

2025
[5]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019

2019
[6]

RAGAS: Automated evalu- ation of retrieval-augmented generation.arXiv preprint arXiv:2309.15217,

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation.arXiv preprint arXiv:2309.15217, 2023

work page arXiv 2023
[7]

Colpali: Efficient document retrieval with vision language models

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. In International Conference on Learning Representations (ICLR), 2025

2025
[8]

Rarr: Researching and revising what language models say, using language models

Luyu Gao, Zhuyun Dai, Panupong Pasupat, et al. Rarr: Researching and revising what language models say, using language models. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2023

2023
[9]

Vectara hallucination evaluation model leaderboard

Anand Hughes et al. Vectara hallucination evaluation model leaderboard. Technical report, Vectara, 2025

2025
[10]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2020

2020
[11]

Fact, fetch, and reason: A unified evaluation of retrieval- augmented generation

Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval- augmented generation. InProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025

2025
[12]

Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics (TACL), 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, et al. Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics (TACL), 2019

2019
[13]

Document haystack: End-to-end evaluation of information retrieval

Philippe Laban et al. Document haystack: End-to-end evaluation of information retrieval. arXiv preprint arXiv:2503.11438, 2025

work page arXiv 2025
[14]

Tablebank: A benchmark dataset for table detection and recognition

Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. Tablebank: A benchmark dataset for table detection and recognition. InLanguage Resources and Evaluation Conference (LREC), 2020

2020
[15]

Large language models in document intelligence: A comprehensive survey, recent advances, challenges and future trends.ACM Transactions on Information Systems, 2026

Hao Liu et al. Large language models in document intelligence: A comprehensive survey, recent advances, challenges and future trends.ACM Transactions on Information Systems, 2026

2026
[16]

Pt-rag: Structure-fidelity retrieval-augmented generation for academic papers.arXiv preprint arXiv:2602.13647,

Yang Liu et al. Sf-rag: Structure-fidelity retrieval-augmented generation for academic question answering.arXiv preprint arXiv:2602.13647, 2026

work page arXiv 2026
[17]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and C V Jawahar. Docvqa: A dataset for vqa on document images. InIEEE/CVF Winter Conference on Applications of Computer Vision (W ACV), 2021. 14

2021
[18]

Agenticocr: Parsing only what you need for efficient retrieval-augmented generation.arXiv preprint arXiv:2602.24134, 2026

OpenDataLab. Agenticocr: Parsing only what you need for efficient retrieval-augmented generation.arXiv preprint arXiv:2602.24134, 2026

work page arXiv 2026
[19]

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations

Linke Ouyang, Bin Wang, Junyuan Zhang, et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[20]

Readoc: A unified benchmark for realistic document structured extraction

Junyu Qi et al. Readoc: A unified benchmark for realistic document structured extraction. In Findings of the Association for Computational Linguistics (ACL), 2025

2025
[21]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research (JMLR), 2020

Colin Raffel, Noam Shazeer, Adam Roberts, et al. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research (JMLR), 2020

2020
[22]

Ares: An automated evaluation framework for retrieval-augmented generation systems

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. Ares: An automated evaluation framework for retrieval-augmented generation systems. InProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

2024
[23]

Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021

2021
[24]

Fever: a large-scale dataset for fact extraction and verification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. InProceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2018

2018
[25]

Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation

Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, and Wentao Zhang. Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation. InIEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[26]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InInternational Conference on Learning Representations (ICLR), 2020

2020
[27]

Docling Technical Report

Zhiyuan Zheng et al. Docling: Efficient document processing for ai.arXiv preprint arXiv:2408.09869, 2024

work page arXiv 2024
[28]

Publaynet: Largest dataset ever for document layout analysis

Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: Largest dataset ever for document layout analysis. InInternational Conference on Document Analysis and Recognition (ICDAR), 2019. A Pipeline Hyperparameters Agentic routing classifier (proposed).A logistic regression model trained on query-complexity labels. Planned features: query length in tok...

2019