arxiv: 2605.08838 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Generating Leakage-Free Benchmarks for Robust RAG Evaluation

Bowen Jin, Jennifer Neville, Jiaxing Zhang, Jiayi Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords RAGbenchmark generationknowledge leakageretrieval-augmented generationsemi-synthetic benchmarksreasoning graphsentity replacementevaluation robustness

0 comments

The pith

SeedRG creates fresh RAG benchmark examples by replacing entities in extracted reasoning graphs to block knowledge leakage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing RAG benchmarks often contain questions that large language models can answer from their training data alone, making it hard to tell if retrieval is actually needed. The paper introduces SeedRG, a pipeline that starts with seed datasets, extracts the reasoning structure from question and context pairs, and generates new examples by swapping entities of the same type. This keeps the reasoning pattern intact but produces instances unlikely to be in the model's memory. Two checks ensure the new examples stay difficult and require retrieval. A sympathetic reader cares because reliable benchmarks are essential for knowing when and how well RAG improves answers.

Core claim

The central claim is that starting from a seed benchmark, extracting a reasoning graph from question-context pairs, and applying type-constrained entity replacement generates structurally similar yet novel instances that preserve task difficulty and are absent from parametric knowledge, after consistency and leakage verification steps.

What carries the argument

reasoning graph extraction combined with type-constrained entity replacement, which captures the underlying structure and creates new but equivalent instances.

If this is right

RAG models can be tested on whether they truly depend on retrieved information rather than internal knowledge.
Benchmarks can be refreshed periodically to counter the effects of models training on public data.
Evaluation results become more trustworthy for comparing different RAG approaches.
The method reduces the need to collect entirely new human-annotated datasets for each evaluation cycle.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could be extended to create dynamic benchmarks that evolve with model capabilities.
It suggests that graph-based representations of reasoning may be useful for other data augmentation tasks in NLP.
Adopting such pipelines might shift the field toward generated rather than static benchmarks for robustness testing.

Load-bearing premise

Type-constrained entity replacement on reasoning graphs from seed data will yield new instances that match the original difficulty and lie outside the model's existing knowledge.

What would settle it

Testing whether language models achieve similar accuracy on the generated examples without any retrieval as they do on the original seed questions would show if leakage persists.

Figures

Figures reproduced from arXiv: 2605.08838 by Bowen Jin, Jennifer Neville, Jiaxing Zhang, Jiayi Liu.

**Figure 1.** Figure 1: An example of the RAG evaluation gap. (a) Existing benchmarks overlap with LLM pretraining data, making retrieval redundant. (b) SeedRG preserves the reasoning structure of seed questions while replacing all entities with novel counterparts, ensuring no overlap with parametric knowledge. knowledge-intensive tasks. Progress on RAG systems is typically measured using multi-hop question answering benchmarks, … view at source ↗

**Figure 2.** Figure 2: Overview of the SeedRG benchmark generation pipeline. Given a seed ques [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Generation quality comparison: SeedRG vs Direct Generation. SeedRG produces [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: RAG algorithm evaluation and stability on SeedRG. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Retrieval-augmented generation (RAG) is widely used to augment large language models (LLMs) with external knowledge. However, many benchmark datasets, designed to test RAG performance, comprise many questions that can already be answered from an LLM's parametric memory. This leads to unreliable evaluation. We refer to this phenomenon as knowledge leakage: cases where RAG tasks are solvable without retrieval. This issue worsens over time due to benchmark aging. As benchmarks are reused for training, their contents are increasingly absorbed into model parameters, making them less effective for evaluating retrieval. We introduce SeedRG, a semi-synthetic benchmark generation pipeline that mitigates knowledge leakage and addresses the issue of benchmark aging. Starting from a seed benchmark dataset, SeedRG extracts a reasoning graph from question-context pairs to capture their underlying reasoning structure, and then generates new examples via type-constrained entity replacement. This process produces structurally similar but novel instances that are unlikely to exist in the model's parametric knowledge, while preserving the original reasoning patterns. To ensure quality, we incorporate two verification steps: (1) a reasoning-graph consistency check to maintain task difficulty, and (2) a knowledge-leakage filter to exclude instances answerable without retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SeedRG gives a concrete pipeline for turning old RAG examples into new ones via reasoning graphs and entity swaps, but the work still needs runs on real models to show the filters actually cut leakage.

read the letter

The paper's main contribution is SeedRG, which pulls a reasoning graph from a question and its context, then swaps in new entities of matching types to create fresh instances. Two filters follow: one checks that the graph structure stays intact, and the other drops any new example an LLM can answer without the provided context. This targets the real problem that reused benchmarks get absorbed into model parameters over time, so retrieval performance looks better than it is. The approach is new in its specific combination of graph extraction plus type-constrained replacement aimed at RAG leakage rather than generic data augmentation. It does a solid job spelling out the steps in enough detail that someone could try to reimplement them. The description also correctly notes that simple prompting without retrieval is the practical way to test for leakage. The soft spots sit in the validation. No numbers appear on how many candidates survive the filters, how often the consistency check actually preserves difficulty, or whether the new examples still require retrieval when tested against current models. The leakage filter itself is a single LLM prompt, which can miss partial parametric knowledge of the swapped entities. If entity rarity changes the reasoning load, the graph check may not catch it. These gaps are not fatal but mean the central claim rests on untested assumptions for now. The paper is aimed at people who build or evaluate RAG systems and need test sets that stay fresh. Readers working on benchmark construction or contamination studies would get the most direct use from the pipeline description. It deserves a serious referee because the leakage issue is widespread and the method is specific enough to test or extend. I would send it to review and ask for experiments that measure leakage rates and difficulty on the generated set versus the seed set.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SeedRG, a semi-synthetic benchmark generation pipeline for RAG evaluation. Starting from a seed dataset, it extracts reasoning graphs from question-context pairs, applies type-constrained entity replacement to generate novel instances that are structurally similar yet unlikely to appear in model parametric knowledge, and applies two verification steps (reasoning-graph consistency check and knowledge-leakage filter) to preserve task difficulty and exclude cases solvable without retrieval, thereby addressing knowledge leakage and benchmark aging.

Significance. If the pipeline can be shown to reliably produce leakage-free instances of equivalent difficulty, the work would provide a constructive method for generating fresh, robust RAG benchmarks that better isolate retrieval performance from parametric memorization. This directly tackles a growing practical problem in LLM evaluation as training corpora expand and benchmarks age.

major comments (2)

[Abstract / Method outline] Abstract and pipeline description: the central claim that type-constrained entity replacement yields instances that are (a) structurally equivalent, (b) absent from parametric knowledge, and (c) of preserved difficulty rests on the leakage filter and consistency check, yet the manuscript supplies no empirical results, error analysis, or implementation details (e.g., prompt design for the filter or how graph consistency is quantified) to demonstrate these properties hold.
[Verification steps] Verification steps: the knowledge-leakage filter is defined as excluding instances answerable without retrieval and the consistency check as preserving difficulty, but these are post-hoc and the paper does not address whether entity replacement can create easier/harder variants via rarity or domain shift, or whether a single-prompt filter misses partial parametric knowledge of new combinations.

minor comments (2)

[Abstract] The abstract outlines the two verification steps but does not specify the exact criteria or models used; expanding this in the main text would aid reproducibility.
[Method] Notation for the reasoning graph extraction and entity replacement could be formalized (e.g., with pseudocode or a diagram) to clarify the type-constraint mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical importance of addressing knowledge leakage and benchmark aging in RAG evaluation. We address each major comment below and outline the revisions we will make to strengthen the empirical grounding and discussion of the SeedRG pipeline.

read point-by-point responses

Referee: [Abstract / Method outline] Abstract and pipeline description: the central claim that type-constrained entity replacement yields instances that are (a) structurally equivalent, (b) absent from parametric knowledge, and (c) of preserved difficulty rests on the leakage filter and consistency check, yet the manuscript supplies no empirical results, error analysis, or implementation details (e.g., prompt design for the filter or how graph consistency is quantified) to demonstrate these properties hold.

Authors: We agree that the current description would be strengthened by explicit empirical validation and implementation details. In the revised manuscript we will add a dedicated evaluation section reporting results on the leakage filter and consistency check (including precision/recall against human judgments where feasible), an error analysis of failure cases, the exact prompt templates used for the knowledge-leakage filter, and the concrete metric employed to quantify reasoning-graph consistency (structural isomorphism plus node-type matching). These additions will directly support the claims of structural equivalence, novelty relative to parametric knowledge, and difficulty preservation. revision: yes
Referee: [Verification steps] Verification steps: the knowledge-leakage filter is defined as excluding instances answerable without retrieval and the consistency check as preserving difficulty, but these are post-hoc and the paper does not address whether entity replacement can create easier/harder variants via rarity or domain shift, or whether a single-prompt filter misses partial parametric knowledge of new combinations.

Authors: We acknowledge that post-hoc filters are not infallible and that type-constrained replacement may inadvertently alter perceived difficulty through entity rarity or subtle domain shifts, while a single-prompt filter could miss partial parametric knowledge of novel combinations. In the revision we will expand the discussion of verification limitations, include a targeted analysis (on a held-out subset) measuring difficulty shifts before and after replacement, and describe an improved multi-prompt ensemble variant of the leakage filter that we will adopt. We will also flag these issues as important directions for future robustness work. revision: partial

Circularity Check

0 steps flagged

No circularity in SeedRG constructive pipeline

full rationale

The paper presents a descriptive generation pipeline: extract reasoning graph from seed question-context pairs, apply type-constrained entity replacement to create novel instances, then run independent verification via graph-consistency check and knowledge-leakage filter. No equations, fitted parameters, or predictions appear. No self-citations are invoked as load-bearing premises. The steps are self-contained constructive operations that do not reduce any claimed output (leakage-free, difficulty-preserving instances) to a tautological redefinition of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 1 invented entities

The approach rests on several domain assumptions about reasoning graphs and filters that receive no independent evidence in the abstract; no explicit free parameters or new physical entities are introduced.

axioms (3)

domain assumption Reasoning graph extracted from question-context pairs accurately captures the underlying reasoning structure
Invoked to enable generation of new instances via entity replacement
domain assumption Type-constrained entity replacement preserves reasoning patterns and task difficulty
Required for the generated examples to remain valid benchmarks
domain assumption Knowledge-leakage filter successfully excludes instances answerable without retrieval
Central to the claim that the output is leakage-free

invented entities (1)

SeedRG pipeline no independent evidence
purpose: Generate semi-synthetic leakage-free RAG benchmarks
Newly proposed method whose effectiveness is not demonstrated in the abstract

pith-pipeline@v0.9.0 · 5513 in / 1298 out tokens · 47857 ms · 2026-05-12T02:59:39.632266+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
SeedRG extracts a reasoning graph from question-context pairs ... generates new examples via type-constrained entity replacement ... reasoning-graph consistency check ... knowledge-leakage filter
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
preserves the original reasoning patterns ... structural equivalence: Gq' ≅ Gq

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Prompt leakage effect and mitigation strategies for multi-turn llm applications

Divyansh Agarwal, Alexander Richard Fabbri, Ben Risher, Philippe Laban, Shafiq Joty, and Chien-Sheng Wu. Prompt leakage effect and mitigation strategies for multi-turn llm applications. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pp. 1255–1275,

work page 2024
[2]

Thinkeval: Practical evaluation of knowledge leakage in llm editing using thought-based knowledge graphs.arXiv preprint arXiv:2506.01386,

Manit Baser, Dinil Mon Divakaran, and Mohan Gurusamy. Thinkeval: Practical evaluation of knowledge leakage in llm editing using thought-based knowledge graphs.arXiv preprint arXiv:2506.01386,

work page arXiv
[3]

Safegpt: Preventing data leakage and unethical outputs in enterprise llm use.arXiv preprint arXiv:2601.06366,

Pratyush Desai, Luoxi Tang, Yuqiao Meng, and Zhaohan Xi. Safegpt: Preventing data leakage and unethical outputs in enterprise llm use.arXiv preprint arXiv:2601.06366,

work page arXiv
[4]

Self-boosting large language models with synthetic preference data.arXiv preprint arXiv:2410.06961,

Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, and Furu Wei. Self-boosting large language models with synthetic preference data.arXiv preprint arXiv:2410.06961,

work page arXiv
[5]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From lo- cal to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexan- der Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301,

work page arXiv
[7]

Mukherjee, A

Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707,

work page arXiv
[8]

Kilt: a benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. Kilt: a benchmark for knowledge intensive language tasks. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech...

work page 2021
[9]

Og-rag: ontology-grounded retrieval- augmented generation for large language models

Kartik Sharma, Peeyush Kumar, and Yunqing Li. Og-rag: ontology-grounded retrieval- augmented generation for large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 32950–32969,

work page 2025
[10]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdi- nov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics,

work page 2018
[11]

Hypothetical documents or knowledge leakage? rethinking llm-based query expansion

Yejun Yoon, Jaeyoon Jung, Seunghyun Yoon, and Kunwoo Park. Hypothetical documents or knowledge leakage? rethinking llm-based query expansion. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 19170–19187,

work page 2025
[12]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models.arXiv preprint arXiv:2309.12284,

work page internal anchor Pith review arXiv
[13]

Benchmark leakage trap: Can we trust llm-based recommendation?arXiv preprint arXiv:2602.13626,

Mingqiao Zhang, Qiyao Peng, Yumeng Wang, Chunyuan Liu, and Hongtao Liu. Benchmark leakage trap: Can we trust llm-based recommendation?arXiv preprint arXiv:2602.13626,

work page arXiv
[14]

Don’t make your llm an evaluation benchmark cheater

Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964,

work page arXiv
[15]

Rageval: Scenario specific rag evaluation dataset generation framework.arXiv preprint arXiv:2408.01262,

Kunlun Zhu, Yifan Luo, Dingling Xu, Yukun Yan, Zhenghao Liu, Shi Yu, Ruobing Wang, Shuo Wang, Yishan Li, Nan Zhang, et al. Rageval: Scenario specific rag evaluation dataset generation framework.arXiv preprint arXiv:2408.01262,

work page arXiv