pith. sign in

arxiv: 2605.05253 · v2 · pith:XHRQXUERnew · submitted 2026-05-05 · 💻 cs.IR

EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge

Pith reviewed 2026-05-21 08:02 UTC · model grok-4.3

classification 💻 cs.IR
keywords RAG benchmarkenterprise knowledgesynthetic datasetretrieval-augmented generationinternal documentsmulti-document reasoningAI agents
0
0 comments X

The pith

A synthetic benchmark of 500,000 enterprise documents and 500 questions tests RAG systems on realistic company-internal data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create the first widely usable benchmark that models the messy, interconnected nature of proprietary corporate information rather than public web pages. Existing RAG datasets leave a gap because companies now run AI agents over their own Slack threads, emails, project tickets, and shared drives. The new resource supplies roughly half a million documents drawn from nine common internal sources, together with questions that range from single-fact lookup to multi-document reasoning, conflict resolution, and recognition of missing information. A generation framework is included so that organizations can produce scaled or industry-specific variants while preserving cross-document links and realistic noise. If the benchmark proves representative, it supplies a concrete yardstick for measuring and improving retrieval-augmented systems that must operate inside actual enterprises.

Core claim

We present EnterpriseRAG-Bench, a dataset consisting of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) and 500 questions across ten categories that test distinct retrieval and reasoning capabilities. The corpus is generated with cross-document coherence grounded in shared projects, people, and initiatives and augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information. The question set ranges from simple single-document lookups to multi-document reasoning, constrained retrieval, conflict resolution, and recognizing when information isabsent

What carries the argument

The synthetic corpus generation framework that enforces cross-document coherence through shared projects, people, and initiatives while injecting realistic noise including misfiled items, near-duplicates, and contradictions.

If this is right

  • RAG developers gain a standardized testbed for measuring performance on multi-hop reasoning and conflict handling inside proprietary data environments.
  • The released evaluation harness and public leaderboard enable direct comparison of retrieval methods on enterprise-style tasks.
  • Organizations can reuse the generation framework to produce custom variants matched to their own source mix, scale, and industry.
  • Questions that explicitly probe for absent information highlight when current systems should abstain rather than hallucinate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread use of the benchmark could shift research emphasis toward retrieval methods that tolerate the partial and contradictory records typical of internal systems.
  • Future extensions might incorporate access-control rules or time-stamped evolution of documents to mirror enterprise security and version history.
  • The resource may serve as a seed for domain-specific variants, for example in regulated industries where document provenance and audit trails are required.

Load-bearing premise

The generated documents and noise patterns sufficiently resemble the structure and inconsistencies found in real company-internal knowledge bases.

What would settle it

A side-by-side statistical comparison of the synthetic corpus against anonymized logs from an actual enterprise that reveals markedly different frequencies of cross-document links, conflict types, or document misplacement.

Figures

Figures reproduced from arXiv: 2605.05253 by Chris Weaver, Joachim Rahmfeld, Mark H. Butler, Roshan Desai, Weijia Chen, Wenxi Huang, Yuhong Sun.

Figure 1
Figure 1. Figure 1: t-SNE projections for BrowseComp-Plus, Onyx data, EnterpriseRAG-Bench. view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the generation framework scaffolding. Each downstream generation step is view at source ↗
Figure 3
Figure 3. Figure 3: Recall@10 and average cosine similarity of the 10 nearest neighbors vs. corpus size. Note: view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) has become the standard approach for grounding large language models in information that was not available during training. While existing datasets and benchmarks focus on web or other public sources, there is still no widely adopted dataset that realistically reflects the nature of company-internal knowledge. Meanwhile, startups, enterprises, and researchers are increasingly developing AI Agents designed to operate over exactly this kind of proprietary data. To close this gap, we release a synthetic enterprise corpus, its generation framework, and a leaderboard. We present EnterpriseRAG-Bench, a dataset consisting of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) and 500 questions across ten categories that test distinct retrieval and reasoning capabilities. The corpus is generated with cross-document coherence (grounded in shared projects, people, and initiatives) and augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information. The question set ranges from simple single-document lookups to multi-document reasoning, constrained retrieval, conflict resolution, and recognizing when information is absent. The generation framework lets teams generate variants tailored to their own industry, scale, and source mix. The dataset, code, evaluation harness, and leaderboard are available at https://github.com/onyx-dot-app/EnterpriseRAG-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents EnterpriseRAG-Bench, a synthetic dataset and benchmark for evaluating Retrieval-Augmented Generation (RAG) systems on company-internal knowledge. It consists of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) generated with cross-document coherence grounded in shared projects, people, and initiatives, plus realistic noise including misfiled documents, near-duplicates, and conflicting information. The benchmark also includes 500 questions across ten categories testing capabilities from single-document lookup to multi-document reasoning, constrained retrieval, conflict resolution, and detecting absent information, along with a customizable generation framework, evaluation harness, and leaderboard released on GitHub.

Significance. If the synthetic corpus construction holds as a faithful model of real enterprise data distributions, the benchmark would address a clear gap in existing RAG evaluation resources that focus primarily on public web data. The open release of the generation framework for industry-specific customization and the leaderboard promote reproducibility and community use, which are notable strengths for a data and benchmark contribution in information retrieval.

major comments (2)
  1. [Abstract] Abstract: The claim that the corpus 'accurately reflects the nature of company-internal knowledge' through cross-document coherence and injected noise (misfiled documents, near-duplicates, conflicting information) is load-bearing for the benchmark's utility, yet the manuscript provides no quantitative fidelity metrics such as distributional comparisons of entity graphs, conflict rates, or retrieval difficulty against anonymized real enterprise logs.
  2. [Dataset Construction] Dataset description: No blinded expert ratings from practitioners or external validation experiments are reported to confirm that the synthetic noise and coherence properties match observed distributions in actual company internal knowledge bases across the nine source types.
minor comments (1)
  1. [Abstract] The GitHub repository link is provided for the dataset, code, and leaderboard, which supports reproducibility; however, the main text could include a brief summary of the repository contents and usage instructions for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment of the benchmark's significance and for highlighting areas where additional validation would be beneficial. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the corpus 'accurately reflects the nature of company-internal knowledge' through cross-document coherence and injected noise (misfiled documents, near-duplicates, conflicting information) is load-bearing for the benchmark's utility, yet the manuscript provides no quantitative fidelity metrics such as distributional comparisons of entity graphs, conflict rates, or retrieval difficulty against anonymized real enterprise logs.

    Authors: We concur that quantitative fidelity metrics comparing the synthetic corpus to real enterprise data would provide valuable support for the benchmark's claims. However, such comparisons are challenging because anonymized real enterprise logs are typically not available for research purposes due to privacy regulations and competitive sensitivities. This limitation is in fact a key reason for developing synthetic alternatives. The corpus was constructed using heuristics derived from publicly documented characteristics of enterprise data and input from industry experts. In the revised manuscript, we will expand the dataset construction section to include more details on these design choices and report internal statistics on the generated noise levels, such as the proportion of conflicting information and near-duplicates. revision: yes

  2. Referee: [Dataset Construction] Dataset description: No blinded expert ratings from practitioners or external validation experiments are reported to confirm that the synthetic noise and coherence properties match observed distributions in actual company internal knowledge bases across the nine source types.

    Authors: We recognize the value of blinded expert ratings and external validation for confirming the realism of the synthetic data. The current manuscript focuses on the release of the benchmark, generation framework, and initial evaluation harness. We did not include such ratings in this version to prioritize timely release and community access. We will revise the paper to include a limitations section acknowledging this and outlining plans for future validation studies. Additionally, the open-source framework allows practitioners to perform their own validations tailored to specific company contexts. revision: partial

Circularity Check

0 steps flagged

Benchmark dataset release contains no derivation chain or self-referential predictions

full rationale

The paper releases a synthetic corpus, generation framework, and question set for enterprise RAG evaluation. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described structure. The central contribution is the artifact and its generation process itself rather than any computed result that could reduce to inputs by construction. Claims about cross-document coherence and injected noise are modeling choices whose fidelity is external to the paper, not a circular reduction of a result to its own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the premise that synthetic documents linked by shared entities plus injected noise produce a realistic proxy for proprietary company data; no free parameters or invented physical entities are described.

axioms (1)
  • domain assumption Synthetic generation with cross-document coherence and added noise can produce data that realistically reflects company-internal knowledge.
    Stated in the abstract as the motivation for the corpus construction.

pith-pipeline@v0.9.0 · 5803 in / 1239 out tokens · 37843 ms · 2026-05-21T08:02:35.397387+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

  1. [1]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. MS MARCO : A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2018

  2. [2]

    FinQA : A dataset of numerical reasoning over financial data

    Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borber, and Michael Bendersky. FinQA : A dataset of numerical reasoning over financial data. In Proceedings of EMNLP, 2021

  3. [3]

    Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv preprint arXiv:2508.06600, 2025

    Zhiyu Chen et al. BrowseComp-Plus : A controlled evaluation framework for browsing agents. arXiv preprint arXiv:2508.06600, 2025

  4. [4]

    Meet KARL : A faster agent for enterprise knowledge, powered by custom RL

    Databricks . Meet KARL : A faster agent for enterprise knowledge, powered by custom RL . Technical report, Databricks, 2025

  5. [5]

    PubMedQA : A dataset for biomedical research question answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA : A dataset for biomedical research question answering. In Proceedings of EMNLP, 2019

  6. [6]

    Natural questions: A benchmark for question answering research

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 0 453--466, 2019

  7. [7]

    Stuart P. Lloyd. Least squares quantization in PCM . IEEE Transactions on Information Theory, 28 0 (2): 0 129--137, 1982

  8. [8]

    Malkov and D

    Yu A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42 0 (4): 0 824--836, 2020

  9. [9]

    MTEB : Massive text embedding benchmark

    Niklas Muennighoff, Nouamane Tazi, Lo \"i c Magne, and Nils Reimers. MTEB : Massive text embedding benchmark. In Proceedings of EACL, 2023

  10. [10]

    New embedding models and API updates

    OpenAI . New embedding models and API updates. OpenAI Blog, 2024

  11. [11]

    OpenAI . GPT-5.4 . OpenAI, 2026

  12. [12]

    KILT : A benchmark for knowledge intensive language tasks

    Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Yacine Yaber, et al. KILT : A benchmark for knowledge intensive language tasks. In Proceedings of NAACL, 2021

  13. [13]

    The probabilistic relevance framework: BM25 and beyond

    Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3 0 (4): 0 333--389, 2009

  14. [14]

    Shreyas Subramanian, Adewale Akinfaderin, Yanyan Zhang, Ishan Singh, Mani Khanuja, Sandeep Singh, and Maira Ladeira Tanke

    Sathya Subramanian et al. Keyword search is all you need. arXiv preprint arXiv:2602.23368, 2025

  15. [15]

    BEIR : A heterogeneous benchmark for zero-shot evaluation of information retrieval models

    Nandan Thakur, Nils Reimers, Andreas R \"u ckl \'e , Abhishek Srivastava, and Iryna Gurevych. BEIR : A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Proceedings of NeurIPS, 2021

  16. [16]

    MuSiQue : Multihop questions via single hop question composition

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue : Multihop questions via single hop question composition. Transactions of the Association for Computational Linguistics, 10: 0 539--554, 2022

  17. [17]

    Visualizing data using t-SNE

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE . Journal of Machine Learning Research, 9: 0 2579--2605, 2008

  18. [18]

    HotpotQA : A dataset for diverse, explainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Proceedings of EMNLP, 2018