EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge
Pith reviewed 2026-05-21 08:02 UTC · model grok-4.3
The pith
A synthetic benchmark of 500,000 enterprise documents and 500 questions tests RAG systems on realistic company-internal data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present EnterpriseRAG-Bench, a dataset consisting of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) and 500 questions across ten categories that test distinct retrieval and reasoning capabilities. The corpus is generated with cross-document coherence grounded in shared projects, people, and initiatives and augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information. The question set ranges from simple single-document lookups to multi-document reasoning, constrained retrieval, conflict resolution, and recognizing when information isabsent
What carries the argument
The synthetic corpus generation framework that enforces cross-document coherence through shared projects, people, and initiatives while injecting realistic noise including misfiled items, near-duplicates, and contradictions.
If this is right
- RAG developers gain a standardized testbed for measuring performance on multi-hop reasoning and conflict handling inside proprietary data environments.
- The released evaluation harness and public leaderboard enable direct comparison of retrieval methods on enterprise-style tasks.
- Organizations can reuse the generation framework to produce custom variants matched to their own source mix, scale, and industry.
- Questions that explicitly probe for absent information highlight when current systems should abstain rather than hallucinate.
Where Pith is reading between the lines
- Widespread use of the benchmark could shift research emphasis toward retrieval methods that tolerate the partial and contradictory records typical of internal systems.
- Future extensions might incorporate access-control rules or time-stamped evolution of documents to mirror enterprise security and version history.
- The resource may serve as a seed for domain-specific variants, for example in regulated industries where document provenance and audit trails are required.
Load-bearing premise
The generated documents and noise patterns sufficiently resemble the structure and inconsistencies found in real company-internal knowledge bases.
What would settle it
A side-by-side statistical comparison of the synthetic corpus against anonymized logs from an actual enterprise that reveals markedly different frequencies of cross-document links, conflict types, or document misplacement.
Figures
read the original abstract
Retrieval-Augmented Generation (RAG) has become the standard approach for grounding large language models in information that was not available during training. While existing datasets and benchmarks focus on web or other public sources, there is still no widely adopted dataset that realistically reflects the nature of company-internal knowledge. Meanwhile, startups, enterprises, and researchers are increasingly developing AI Agents designed to operate over exactly this kind of proprietary data. To close this gap, we release a synthetic enterprise corpus, its generation framework, and a leaderboard. We present EnterpriseRAG-Bench, a dataset consisting of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) and 500 questions across ten categories that test distinct retrieval and reasoning capabilities. The corpus is generated with cross-document coherence (grounded in shared projects, people, and initiatives) and augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information. The question set ranges from simple single-document lookups to multi-document reasoning, constrained retrieval, conflict resolution, and recognizing when information is absent. The generation framework lets teams generate variants tailored to their own industry, scale, and source mix. The dataset, code, evaluation harness, and leaderboard are available at https://github.com/onyx-dot-app/EnterpriseRAG-Bench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents EnterpriseRAG-Bench, a synthetic dataset and benchmark for evaluating Retrieval-Augmented Generation (RAG) systems on company-internal knowledge. It consists of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) generated with cross-document coherence grounded in shared projects, people, and initiatives, plus realistic noise including misfiled documents, near-duplicates, and conflicting information. The benchmark also includes 500 questions across ten categories testing capabilities from single-document lookup to multi-document reasoning, constrained retrieval, conflict resolution, and detecting absent information, along with a customizable generation framework, evaluation harness, and leaderboard released on GitHub.
Significance. If the synthetic corpus construction holds as a faithful model of real enterprise data distributions, the benchmark would address a clear gap in existing RAG evaluation resources that focus primarily on public web data. The open release of the generation framework for industry-specific customization and the leaderboard promote reproducibility and community use, which are notable strengths for a data and benchmark contribution in information retrieval.
major comments (2)
- [Abstract] Abstract: The claim that the corpus 'accurately reflects the nature of company-internal knowledge' through cross-document coherence and injected noise (misfiled documents, near-duplicates, conflicting information) is load-bearing for the benchmark's utility, yet the manuscript provides no quantitative fidelity metrics such as distributional comparisons of entity graphs, conflict rates, or retrieval difficulty against anonymized real enterprise logs.
- [Dataset Construction] Dataset description: No blinded expert ratings from practitioners or external validation experiments are reported to confirm that the synthetic noise and coherence properties match observed distributions in actual company internal knowledge bases across the nine source types.
minor comments (1)
- [Abstract] The GitHub repository link is provided for the dataset, code, and leaderboard, which supports reproducibility; however, the main text could include a brief summary of the repository contents and usage instructions for clarity.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the benchmark's significance and for highlighting areas where additional validation would be beneficial. We address the major comments point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the corpus 'accurately reflects the nature of company-internal knowledge' through cross-document coherence and injected noise (misfiled documents, near-duplicates, conflicting information) is load-bearing for the benchmark's utility, yet the manuscript provides no quantitative fidelity metrics such as distributional comparisons of entity graphs, conflict rates, or retrieval difficulty against anonymized real enterprise logs.
Authors: We concur that quantitative fidelity metrics comparing the synthetic corpus to real enterprise data would provide valuable support for the benchmark's claims. However, such comparisons are challenging because anonymized real enterprise logs are typically not available for research purposes due to privacy regulations and competitive sensitivities. This limitation is in fact a key reason for developing synthetic alternatives. The corpus was constructed using heuristics derived from publicly documented characteristics of enterprise data and input from industry experts. In the revised manuscript, we will expand the dataset construction section to include more details on these design choices and report internal statistics on the generated noise levels, such as the proportion of conflicting information and near-duplicates. revision: yes
-
Referee: [Dataset Construction] Dataset description: No blinded expert ratings from practitioners or external validation experiments are reported to confirm that the synthetic noise and coherence properties match observed distributions in actual company internal knowledge bases across the nine source types.
Authors: We recognize the value of blinded expert ratings and external validation for confirming the realism of the synthetic data. The current manuscript focuses on the release of the benchmark, generation framework, and initial evaluation harness. We did not include such ratings in this version to prioritize timely release and community access. We will revise the paper to include a limitations section acknowledging this and outlining plans for future validation studies. Additionally, the open-source framework allows practitioners to perform their own validations tailored to specific company contexts. revision: partial
Circularity Check
Benchmark dataset release contains no derivation chain or self-referential predictions
full rationale
The paper releases a synthetic corpus, generation framework, and question set for enterprise RAG evaluation. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described structure. The central contribution is the artifact and its generation process itself rather than any computed result that could reduce to inputs by construction. Claims about cross-document coherence and injected noise are modeling choices whose fidelity is external to the paper, not a circular reduction of a result to its own assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetic generation with cross-document coherence and added noise can produce data that realistically reflects company-internal knowledge.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present EnterpriseRAG-Bench, a dataset consisting of approximately 500,000 documents spanning nine enterprise source types ... and 500 questions across ten categories
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The corpus is generated with cross-document coherence ... augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. MS MARCO : A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
FinQA : A dataset of numerical reasoning over financial data
Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borber, and Michael Bendersky. FinQA : A dataset of numerical reasoning over financial data. In Proceedings of EMNLP, 2021
work page 2021
-
[3]
Zhiyu Chen et al. BrowseComp-Plus : A controlled evaluation framework for browsing agents. arXiv preprint arXiv:2508.06600, 2025
-
[4]
Meet KARL : A faster agent for enterprise knowledge, powered by custom RL
Databricks . Meet KARL : A faster agent for enterprise knowledge, powered by custom RL . Technical report, Databricks, 2025
work page 2025
-
[5]
PubMedQA : A dataset for biomedical research question answering
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA : A dataset for biomedical research question answering. In Proceedings of EMNLP, 2019
work page 2019
-
[6]
Natural questions: A benchmark for question answering research
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 0 453--466, 2019
work page 2019
-
[7]
Stuart P. Lloyd. Least squares quantization in PCM . IEEE Transactions on Information Theory, 28 0 (2): 0 129--137, 1982
work page 1982
-
[8]
Yu A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42 0 (4): 0 824--836, 2020
work page 2020
-
[9]
MTEB : Massive text embedding benchmark
Niklas Muennighoff, Nouamane Tazi, Lo \"i c Magne, and Nils Reimers. MTEB : Massive text embedding benchmark. In Proceedings of EACL, 2023
work page 2023
-
[10]
New embedding models and API updates
OpenAI . New embedding models and API updates. OpenAI Blog, 2024
work page 2024
-
[11]
OpenAI . GPT-5.4 . OpenAI, 2026
work page 2026
-
[12]
KILT : A benchmark for knowledge intensive language tasks
Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Yacine Yaber, et al. KILT : A benchmark for knowledge intensive language tasks. In Proceedings of NAACL, 2021
work page 2021
-
[13]
The probabilistic relevance framework: BM25 and beyond
Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3 0 (4): 0 333--389, 2009
work page 2009
-
[14]
Sathya Subramanian et al. Keyword search is all you need. arXiv preprint arXiv:2602.23368, 2025
-
[15]
BEIR : A heterogeneous benchmark for zero-shot evaluation of information retrieval models
Nandan Thakur, Nils Reimers, Andreas R \"u ckl \'e , Abhishek Srivastava, and Iryna Gurevych. BEIR : A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Proceedings of NeurIPS, 2021
work page 2021
-
[16]
MuSiQue : Multihop questions via single hop question composition
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue : Multihop questions via single hop question composition. Transactions of the Association for Computational Linguistics, 10: 0 539--554, 2022
work page 2022
-
[17]
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE . Journal of Machine Learning Research, 9: 0 2579--2605, 2008
work page 2008
-
[18]
HotpotQA : A dataset for diverse, explainable multi-hop question answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Proceedings of EMNLP, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.