pith. machine review for the scientific record. sign in

arxiv: 2605.05253 · v1 · submitted 2026-05-05 · 💻 cs.IR

Recognition: unknown

EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge

Chris Weaver, Joachim Rahmfeld, Mark H. Butler, Roshan Desai, Wenxi Huang, Yuhong Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:11 UTC · model grok-4.3

classification 💻 cs.IR
keywords RAG benchmarkenterprise dataretrieval-augmented generationsynthetic datasetmulti-document reasoninginternal knowledge managementLLM evaluation
0
0 comments X

The pith

A new benchmark supplies 500,000 synthetic enterprise documents and 500 questions to evaluate retrieval-augmented generation on company-internal knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates EnterpriseRAG-Bench because current RAG benchmarks use public web data while companies need systems that handle their own private records. The authors build a large synthetic collection drawn from nine common enterprise tools and add realistic links across documents plus noise such as misfiled items and conflicting facts. They supply 500 questions divided into ten categories that range from single-document lookups to multi-hop reasoning, conflict handling, and detection of missing information. A generation framework accompanies the dataset so teams can produce tailored versions for different industries or data mixes. The release includes code, an evaluation harness, and a public leaderboard for standardized testing.

Core claim

We present EnterpriseRAG-Bench, a dataset consisting of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) and 500 questions across ten categories that test distinct retrieval and reasoning capabilities. The corpus is generated with cross-document coherence (grounded in shared projects, people, and initiatives) and augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information.

What carries the argument

The EnterpriseRAG-Bench dataset together with its generation framework, which produces a coherent synthetic enterprise corpus containing controlled noise to measure how well RAG systems retrieve and reason over internal company records.

If this is right

  • RAG developers can measure performance on tasks such as constrained retrieval and conflict resolution that arise in company settings.
  • The accompanying framework lets organizations generate custom versions matched to their own document mix and scale.
  • Standardized evaluation on the leaderboard enables direct comparison of retrieval methods for internal knowledge tasks.
  • The question categories reveal where current systems fail when information is absent or spread across multiple sources.
  • Teams gain a reusable testbed for improving AI agents that operate over proprietary data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Wider use of this benchmark could encourage RAG research to prioritize handling of noisy, interconnected internal records over clean public sources.
  • The approach of injecting realistic enterprise noise might be extended to other domains such as legal or medical document collections.
  • If the synthetic data proves predictive, it could reduce the need for costly access to real proprietary data during early model development.
  • Future versions might add temporal dynamics or access-control constraints to simulate live company environments more closely.

Load-bearing premise

The synthetic corpus with cross-document coherence and added noise such as misfiled documents and conflicting information realistically reflects real company-internal knowledge.

What would settle it

A study that shows RAG models achieve different accuracy rankings when tested on actual proprietary enterprise collections versus this synthetic set would show the benchmark does not capture real conditions.

Figures

Figures reproduced from arXiv: 2605.05253 by Chris Weaver, Joachim Rahmfeld, Mark H. Butler, Roshan Desai, Wenxi Huang, Yuhong Sun.

Figure 1
Figure 1. Figure 1: t-SNE projections for BrowseComp-Plus, Onyx data, EnterpriseRAG-Bench. view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the generation framework scaffolding. Each downstream generation step is view at source ↗
Figure 3
Figure 3. Figure 3: Recall@10 and average cosine similarity of the 10 nearest neighbors vs. corpus size. Note: view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) has become the standard approach for grounding large language models in information that was not available during training. While existing datasets and benchmarks focus on web or other public sources, there is still no widely adopted dataset that realistically reflects the nature of company-internal knowledge. Meanwhile, startups, enterprises, and researchers are increasingly developing AI Agents designed to operate over exactly this kind of proprietary data. To close this gap, we release a synthetic enterprise corpus, its generation framework, and a leaderboard. We present EnterpriseRAG-Bench, a dataset consisting of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) and 500 questions across ten categories that test distinct retrieval and reasoning capabilities. The corpus is generated with cross-document coherence (grounded in shared projects, people, and initiatives) and augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information. The question set ranges from simple single-document lookups to multi-document reasoning, constrained retrieval, conflict resolution, and recognizing when information is absent. The generation framework lets teams generate variants tailored to their own industry, scale, and source mix. The dataset, code, evaluation harness, and leaderboard are available at https://github.com/onyx-dot-app/EnterpriseRAG-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces EnterpriseRAG-Bench, a synthetic dataset consisting of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) together with 500 questions across ten categories that test distinct retrieval and reasoning capabilities. The corpus is generated with cross-document coherence grounded in shared projects, people, and initiatives and is augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information; the authors also release the generation framework, evaluation harness, and a public leaderboard.

Significance. If the realism claim holds, the benchmark would fill a clear gap in RAG evaluation by supplying a proxy for proprietary enterprise data, which is increasingly relevant for internal AI agents. The open release of the generation framework, code, and leaderboard is a concrete strength that supports reproducibility and customization to different industries or source mixes.

major comments (1)
  1. [Abstract] Abstract: the central claim that the synthetic corpus 'realistically reflects the nature of company-internal knowledge' because it incorporates cross-document coherence and noise (misfiled documents, conflicts) is presented without any quantitative validation. No comparisons are reported of document-type frequencies, cross-reference density, conflict rates, or retrieval-failure modes against real (even anonymized) enterprise corpora from the listed sources. This validation gap is load-bearing for the benchmark's stated purpose.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive and detailed review. We address the major comment below and have revised the manuscript to acknowledge the validation limitations while preserving the benchmark's value as a synthetic proxy.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the synthetic corpus 'realistically reflects the nature of company-internal knowledge' because it incorporates cross-document coherence and noise (misfiled documents, conflicts) is presented without any quantitative validation. No comparisons are reported of document-type frequencies, cross-reference density, conflict rates, or retrieval-failure modes against real (even anonymized) enterprise corpora from the listed sources. This validation gap is load-bearing for the benchmark's stated purpose.

    Authors: We agree that a direct quantitative comparison to real (even anonymized) enterprise corpora would strengthen the realism claim. However, such data is unavailable to us or the community due to privacy, legal, and competitive sensitivities—the exact reason a synthetic benchmark is needed. The generation framework was designed with input from enterprise practitioners to reflect observed patterns in cross-document coherence, noise, and source distributions. In the revised manuscript we have added a 'Limitations' section that explicitly discusses the absence of empirical validation against real corpora, details the rationale and sources for our coherence/noise parameters, and softens the abstract language from 'realistically reflects' to 'aims to approximate'. We also emphasize that the open framework and code allow users to calibrate and validate against their own proprietary data. These changes address the load-bearing concern without overstating the benchmark's fidelity. revision: yes

standing simulated objections not resolved
  • Quantitative comparisons of document-type frequencies, cross-reference density, conflict rates, or retrieval-failure modes against real (anonymized) enterprise corpora from the listed sources, which cannot be performed due to data access restrictions.

Circularity Check

0 steps flagged

No circularity: new benchmark artifact release with no derivations or self-referential reductions

full rationale

The paper presents EnterpriseRAG-Bench as a released synthetic corpus (~500k documents across nine enterprise sources), generation framework, 500 questions in ten categories, and leaderboard. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The central claims consist of describing the dataset construction (cross-document coherence, added noise) and its intended use for RAG evaluation. These are direct artifact contributions rather than any chain that reduces a result to its own inputs by definition, renaming, or self-citation. The realism of the synthetic data is asserted but not derived from prior results within the paper, leaving no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Rests on domain assumption that synthetic generation can match real enterprise distributions and failure modes; no free parameters or new entities introduced.

axioms (1)
  • domain assumption Synthetic documents from shared projects and people can exhibit realistic cross-document coherence and noise patterns of actual company data.
    Used to support the claim that the benchmark reflects the nature of company-internal knowledge.

pith-pipeline@v0.9.0 · 8832 in / 1000 out tokens · 35834 ms · 2026-05-08T17:11:14.488888+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. MS MARCO : A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2018

  2. [2]

    FinQA : A dataset of numerical reasoning over financial data

    Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borber, and Michael Bendersky. FinQA : A dataset of numerical reasoning over financial data. In Proceedings of EMNLP, 2021

  3. [3]

    Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao

    Zhiyu Chen et al. BrowseComp-Plus : A controlled evaluation framework for browsing agents. arXiv preprint arXiv:2508.06600, 2025

  4. [4]

    Meet KARL : A faster agent for enterprise knowledge, powered by custom RL

    Databricks . Meet KARL : A faster agent for enterprise knowledge, powered by custom RL . Technical report, Databricks, 2025

  5. [5]

    PubMedQA : A dataset for biomedical research question answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA : A dataset for biomedical research question answering. In Proceedings of EMNLP, 2019

  6. [6]

    Natural questions: A benchmark for question answering research

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 0 453--466, 2019

  7. [7]

    Stuart P. Lloyd. Least squares quantization in PCM . IEEE Transactions on Information Theory, 28 0 (2): 0 129--137, 1982

  8. [8]

    Malkov and D

    Yu A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42 0 (4): 0 824--836, 2020

  9. [9]

    MTEB : Massive text embedding benchmark

    Niklas Muennighoff, Nouamane Tazi, Lo \"i c Magne, and Nils Reimers. MTEB : Massive text embedding benchmark. In Proceedings of EACL, 2023

  10. [10]

    New embedding models and API updates

    OpenAI . New embedding models and API updates. OpenAI Blog, 2024

  11. [11]

    OpenAI . GPT-5.4 . OpenAI, 2026

  12. [12]

    KILT : A benchmark for knowledge intensive language tasks

    Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Yacine Yaber, et al. KILT : A benchmark for knowledge intensive language tasks. In Proceedings of NAACL, 2021

  13. [13]

    The probabilistic relevance framework: BM25 and beyond

    Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3 0 (4): 0 333--389, 2009

  14. [14]

    Shreyas Subramanian, Adewale Akinfaderin, Yanyan Zhang, Ishan Singh, Mani Khanuja, Sandeep Singh, and Maira Ladeira Tanke

    Sathya Subramanian et al. Keyword search is all you need. arXiv preprint arXiv:2602.23368, 2025

  15. [15]

    BEIR : A heterogeneous benchmark for zero-shot evaluation of information retrieval models

    Nandan Thakur, Nils Reimers, Andreas R \"u ckl \'e , Abhishek Srivastava, and Iryna Gurevych. BEIR : A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Proceedings of NeurIPS, 2021

  16. [16]

    MuSiQue : Multihop questions via single hop question composition

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue : Multihop questions via single hop question composition. Transactions of the Association for Computational Linguistics, 10: 0 539--554, 2022

  17. [17]

    Visualizing data using t-SNE

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE . Journal of Machine Learning Research, 9: 0 2579--2605, 2008

  18. [18]

    HotpotQA : A dataset for diverse, explainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Proceedings of EMNLP, 2018