EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge

Chris Weaver; Joachim Rahmfeld; Mark H. Butler; Roshan Desai; Weijia Chen; Wenxi Huang; Yuhong Sun

arxiv: 2605.05253 · v2 · pith:XHRQXUERnew · submitted 2026-05-05 · 💻 cs.IR

EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge

Yuhong Sun , Joachim Rahmfeld , Chris Weaver , Weijia Chen , Roshan Desai , Wenxi Huang , Mark H. Butler This is my paper

Pith reviewed 2026-05-21 08:02 UTC · model grok-4.3

classification 💻 cs.IR

keywords RAG benchmarkenterprise knowledgesynthetic datasetretrieval-augmented generationinternal documentsmulti-document reasoningAI agents

0 comments

The pith

A synthetic benchmark of 500,000 enterprise documents and 500 questions tests RAG systems on realistic company-internal data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create the first widely usable benchmark that models the messy, interconnected nature of proprietary corporate information rather than public web pages. Existing RAG datasets leave a gap because companies now run AI agents over their own Slack threads, emails, project tickets, and shared drives. The new resource supplies roughly half a million documents drawn from nine common internal sources, together with questions that range from single-fact lookup to multi-document reasoning, conflict resolution, and recognition of missing information. A generation framework is included so that organizations can produce scaled or industry-specific variants while preserving cross-document links and realistic noise. If the benchmark proves representative, it supplies a concrete yardstick for measuring and improving retrieval-augmented systems that must operate inside actual enterprises.

Core claim

We present EnterpriseRAG-Bench, a dataset consisting of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) and 500 questions across ten categories that test distinct retrieval and reasoning capabilities. The corpus is generated with cross-document coherence grounded in shared projects, people, and initiatives and augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information. The question set ranges from simple single-document lookups to multi-document reasoning, constrained retrieval, conflict resolution, and recognizing when information isabsent

What carries the argument

The synthetic corpus generation framework that enforces cross-document coherence through shared projects, people, and initiatives while injecting realistic noise including misfiled items, near-duplicates, and contradictions.

If this is right

RAG developers gain a standardized testbed for measuring performance on multi-hop reasoning and conflict handling inside proprietary data environments.
The released evaluation harness and public leaderboard enable direct comparison of retrieval methods on enterprise-style tasks.
Organizations can reuse the generation framework to produce custom variants matched to their own source mix, scale, and industry.
Questions that explicitly probe for absent information highlight when current systems should abstain rather than hallucinate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use of the benchmark could shift research emphasis toward retrieval methods that tolerate the partial and contradictory records typical of internal systems.
Future extensions might incorporate access-control rules or time-stamped evolution of documents to mirror enterprise security and version history.
The resource may serve as a seed for domain-specific variants, for example in regulated industries where document provenance and audit trails are required.

Load-bearing premise

The generated documents and noise patterns sufficiently resemble the structure and inconsistencies found in real company-internal knowledge bases.

What would settle it

A side-by-side statistical comparison of the synthetic corpus against anonymized logs from an actual enterprise that reveals markedly different frequencies of cross-document links, conflict types, or document misplacement.

Figures

Figures reproduced from arXiv: 2605.05253 by Chris Weaver, Joachim Rahmfeld, Mark H. Butler, Roshan Desai, Weijia Chen, Wenxi Huang, Yuhong Sun.

**Figure 1.** Figure 1: t-SNE projections for BrowseComp-Plus, Onyx data, EnterpriseRAG-Bench. view at source ↗

**Figure 2.** Figure 2: Overview of the generation framework scaffolding. Each downstream generation step is view at source ↗

**Figure 3.** Figure 3: Recall@10 and average cosine similarity of the 10 nearest neighbors vs. corpus size. Note: view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) has become the standard approach for grounding large language models in information that was not available during training. While existing datasets and benchmarks focus on web or other public sources, there is still no widely adopted dataset that realistically reflects the nature of company-internal knowledge. Meanwhile, startups, enterprises, and researchers are increasingly developing AI Agents designed to operate over exactly this kind of proprietary data. To close this gap, we release a synthetic enterprise corpus, its generation framework, and a leaderboard. We present EnterpriseRAG-Bench, a dataset consisting of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) and 500 questions across ten categories that test distinct retrieval and reasoning capabilities. The corpus is generated with cross-document coherence (grounded in shared projects, people, and initiatives) and augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information. The question set ranges from simple single-document lookups to multi-document reasoning, constrained retrieval, conflict resolution, and recognizing when information is absent. The generation framework lets teams generate variants tailored to their own industry, scale, and source mix. The dataset, code, evaluation harness, and leaderboard are available at https://github.com/onyx-dot-app/EnterpriseRAG-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This releases a new synthetic benchmark for RAG on enterprise internal sources with cross-document links and noise, but the realism of the generated corpus lacks quantitative grounding against real data.

read the letter

The main thing to know is that this paper puts out a benchmark dataset aimed at RAG systems that have to work with company-internal knowledge from tools like Slack, email, Jira, and Confluence. They generated roughly 500k documents across nine source types plus 500 questions in ten categories that range from single-document lookups to multi-hop reasoning, conflict handling, and detecting missing info. The generation includes shared projects and people for coherence and adds noise like duplicates and misfiles. They also ship the code to create custom versions and a public leaderboard. That combination of sources and question types is not covered by existing public RAG sets, so the artifact itself is the real contribution here. Releasing everything on GitHub makes it straightforward for teams to download and run evaluations right away. The framework for adapting the mix to different industries or scales is a practical detail that could see some use. The soft spot is the lack of external checks on whether the synthetic corpus actually matches real enterprise distributions. The abstract and description explain the construction steps, but there are no reported metrics comparing entity overlap, conflict frequency, or retrieval difficulty to anonymized logs from actual companies, nor any blinded ratings from practitioners. Without that, it is hard to judge how well results on this set will carry over to production settings. This is aimed at researchers and engineers working on internal AI agents or RAG for proprietary data. A reader who needs a testbed beyond web sources could find it worth trying. It deserves peer review because the release is concrete and the gap it targets is real, even if the validation side needs more work. I would send it out for referee comments focused on strengthening the fidelity evidence.

Referee Report

2 major / 1 minor

Summary. The paper presents EnterpriseRAG-Bench, a synthetic dataset and benchmark for evaluating Retrieval-Augmented Generation (RAG) systems on company-internal knowledge. It consists of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) generated with cross-document coherence grounded in shared projects, people, and initiatives, plus realistic noise including misfiled documents, near-duplicates, and conflicting information. The benchmark also includes 500 questions across ten categories testing capabilities from single-document lookup to multi-document reasoning, constrained retrieval, conflict resolution, and detecting absent information, along with a customizable generation framework, evaluation harness, and leaderboard released on GitHub.

Significance. If the synthetic corpus construction holds as a faithful model of real enterprise data distributions, the benchmark would address a clear gap in existing RAG evaluation resources that focus primarily on public web data. The open release of the generation framework for industry-specific customization and the leaderboard promote reproducibility and community use, which are notable strengths for a data and benchmark contribution in information retrieval.

major comments (2)

[Abstract] Abstract: The claim that the corpus 'accurately reflects the nature of company-internal knowledge' through cross-document coherence and injected noise (misfiled documents, near-duplicates, conflicting information) is load-bearing for the benchmark's utility, yet the manuscript provides no quantitative fidelity metrics such as distributional comparisons of entity graphs, conflict rates, or retrieval difficulty against anonymized real enterprise logs.
[Dataset Construction] Dataset description: No blinded expert ratings from practitioners or external validation experiments are reported to confirm that the synthetic noise and coherence properties match observed distributions in actual company internal knowledge bases across the nine source types.

minor comments (1)

[Abstract] The GitHub repository link is provided for the dataset, code, and leaderboard, which supports reproducibility; however, the main text could include a brief summary of the repository contents and usage instructions for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment of the benchmark's significance and for highlighting areas where additional validation would be beneficial. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the corpus 'accurately reflects the nature of company-internal knowledge' through cross-document coherence and injected noise (misfiled documents, near-duplicates, conflicting information) is load-bearing for the benchmark's utility, yet the manuscript provides no quantitative fidelity metrics such as distributional comparisons of entity graphs, conflict rates, or retrieval difficulty against anonymized real enterprise logs.

Authors: We concur that quantitative fidelity metrics comparing the synthetic corpus to real enterprise data would provide valuable support for the benchmark's claims. However, such comparisons are challenging because anonymized real enterprise logs are typically not available for research purposes due to privacy regulations and competitive sensitivities. This limitation is in fact a key reason for developing synthetic alternatives. The corpus was constructed using heuristics derived from publicly documented characteristics of enterprise data and input from industry experts. In the revised manuscript, we will expand the dataset construction section to include more details on these design choices and report internal statistics on the generated noise levels, such as the proportion of conflicting information and near-duplicates. revision: yes
Referee: [Dataset Construction] Dataset description: No blinded expert ratings from practitioners or external validation experiments are reported to confirm that the synthetic noise and coherence properties match observed distributions in actual company internal knowledge bases across the nine source types.

Authors: We recognize the value of blinded expert ratings and external validation for confirming the realism of the synthetic data. The current manuscript focuses on the release of the benchmark, generation framework, and initial evaluation harness. We did not include such ratings in this version to prioritize timely release and community access. We will revise the paper to include a limitations section acknowledging this and outlining plans for future validation studies. Additionally, the open-source framework allows practitioners to perform their own validations tailored to specific company contexts. revision: partial

Circularity Check

0 steps flagged

Benchmark dataset release contains no derivation chain or self-referential predictions

full rationale

The paper releases a synthetic corpus, generation framework, and question set for enterprise RAG evaluation. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described structure. The central contribution is the artifact and its generation process itself rather than any computed result that could reduce to inputs by construction. Claims about cross-document coherence and injected noise are modeling choices whose fidelity is external to the paper, not a circular reduction of a result to its own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the premise that synthetic documents linked by shared entities plus injected noise produce a realistic proxy for proprietary company data; no free parameters or invented physical entities are described.

axioms (1)

domain assumption Synthetic generation with cross-document coherence and added noise can produce data that realistically reflects company-internal knowledge.
Stated in the abstract as the motivation for the corpus construction.

pith-pipeline@v0.9.0 · 5803 in / 1239 out tokens · 37843 ms · 2026-05-21T08:02:35.397387+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present EnterpriseRAG-Bench, a dataset consisting of approximately 500,000 documents spanning nine enterprise source types ... and 500 questions across ten categories
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The corpus is generated with cross-document coherence ... augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

[1]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. MS MARCO : A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

FinQA : A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borber, and Michael Bendersky. FinQA : A dataset of numerical reasoning over financial data. In Proceedings of EMNLP, 2021

work page 2021
[3]

Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv preprint arXiv:2508.06600, 2025

Zhiyu Chen et al. BrowseComp-Plus : A controlled evaluation framework for browsing agents. arXiv preprint arXiv:2508.06600, 2025

work page arXiv 2025
[4]

Meet KARL : A faster agent for enterprise knowledge, powered by custom RL

Databricks . Meet KARL : A faster agent for enterprise knowledge, powered by custom RL . Technical report, Databricks, 2025

work page 2025
[5]

PubMedQA : A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA : A dataset for biomedical research question answering. In Proceedings of EMNLP, 2019

work page 2019
[6]

Natural questions: A benchmark for question answering research

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 0 453--466, 2019

work page 2019
[7]

Stuart P. Lloyd. Least squares quantization in PCM . IEEE Transactions on Information Theory, 28 0 (2): 0 129--137, 1982

work page 1982
[8]

Malkov and D

Yu A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42 0 (4): 0 824--836, 2020

work page 2020
[9]

MTEB : Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Lo \"i c Magne, and Nils Reimers. MTEB : Massive text embedding benchmark. In Proceedings of EACL, 2023

work page 2023
[10]

New embedding models and API updates

OpenAI . New embedding models and API updates. OpenAI Blog, 2024

work page 2024
[11]

OpenAI . GPT-5.4 . OpenAI, 2026

work page 2026
[12]

KILT : A benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Yacine Yaber, et al. KILT : A benchmark for knowledge intensive language tasks. In Proceedings of NAACL, 2021

work page 2021
[13]

The probabilistic relevance framework: BM25 and beyond

Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3 0 (4): 0 333--389, 2009

work page 2009
[14]

Shreyas Subramanian, Adewale Akinfaderin, Yanyan Zhang, Ishan Singh, Mani Khanuja, Sandeep Singh, and Maira Ladeira Tanke

Sathya Subramanian et al. Keyword search is all you need. arXiv preprint arXiv:2602.23368, 2025

work page arXiv 2025
[15]

BEIR : A heterogeneous benchmark for zero-shot evaluation of information retrieval models

Nandan Thakur, Nils Reimers, Andreas R \"u ckl \'e , Abhishek Srivastava, and Iryna Gurevych. BEIR : A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Proceedings of NeurIPS, 2021

work page 2021
[16]

MuSiQue : Multihop questions via single hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue : Multihop questions via single hop question composition. Transactions of the Association for Computational Linguistics, 10: 0 539--554, 2022

work page 2022
[17]

Visualizing data using t-SNE

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE . Journal of Machine Learning Research, 9: 0 2579--2605, 2008

work page 2008
[18]

HotpotQA : A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Proceedings of EMNLP, 2018

work page 2018

[1] [1]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. MS MARCO : A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

FinQA : A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borber, and Michael Bendersky. FinQA : A dataset of numerical reasoning over financial data. In Proceedings of EMNLP, 2021

work page 2021

[3] [3]

Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv preprint arXiv:2508.06600, 2025

Zhiyu Chen et al. BrowseComp-Plus : A controlled evaluation framework for browsing agents. arXiv preprint arXiv:2508.06600, 2025

work page arXiv 2025

[4] [4]

Meet KARL : A faster agent for enterprise knowledge, powered by custom RL

Databricks . Meet KARL : A faster agent for enterprise knowledge, powered by custom RL . Technical report, Databricks, 2025

work page 2025

[5] [5]

PubMedQA : A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA : A dataset for biomedical research question answering. In Proceedings of EMNLP, 2019

work page 2019

[6] [6]

Natural questions: A benchmark for question answering research

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 0 453--466, 2019

work page 2019

[7] [7]

Stuart P. Lloyd. Least squares quantization in PCM . IEEE Transactions on Information Theory, 28 0 (2): 0 129--137, 1982

work page 1982

[8] [8]

Malkov and D

Yu A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42 0 (4): 0 824--836, 2020

work page 2020

[9] [9]

MTEB : Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Lo \"i c Magne, and Nils Reimers. MTEB : Massive text embedding benchmark. In Proceedings of EACL, 2023

work page 2023

[10] [10]

New embedding models and API updates

OpenAI . New embedding models and API updates. OpenAI Blog, 2024

work page 2024

[11] [11]

OpenAI . GPT-5.4 . OpenAI, 2026

work page 2026

[12] [12]

KILT : A benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Yacine Yaber, et al. KILT : A benchmark for knowledge intensive language tasks. In Proceedings of NAACL, 2021

work page 2021

[13] [13]

The probabilistic relevance framework: BM25 and beyond

Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3 0 (4): 0 333--389, 2009

work page 2009

[14] [14]

Shreyas Subramanian, Adewale Akinfaderin, Yanyan Zhang, Ishan Singh, Mani Khanuja, Sandeep Singh, and Maira Ladeira Tanke

Sathya Subramanian et al. Keyword search is all you need. arXiv preprint arXiv:2602.23368, 2025

work page arXiv 2025

[15] [15]

BEIR : A heterogeneous benchmark for zero-shot evaluation of information retrieval models

Nandan Thakur, Nils Reimers, Andreas R \"u ckl \'e , Abhishek Srivastava, and Iryna Gurevych. BEIR : A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Proceedings of NeurIPS, 2021

work page 2021

[16] [16]

MuSiQue : Multihop questions via single hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue : Multihop questions via single hop question composition. Transactions of the Association for Computational Linguistics, 10: 0 539--554, 2022

work page 2022

[17] [17]

Visualizing data using t-SNE

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE . Journal of Machine Learning Research, 9: 0 2579--2605, 2008

work page 2008

[18] [18]

HotpotQA : A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Proceedings of EMNLP, 2018

work page 2018