pith. sign in

arxiv: 2606.12789 · v1 · pith:ORCFBBVFnew · submitted 2026-06-11 · 💻 cs.CL · cs.IR

How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

Pith reviewed 2026-06-27 07:10 UTC · model grok-4.3

classification 💻 cs.CL cs.IR
keywords RAG evaluationbenchmark granularitysynthetic QA generationhierarchical frameworkdiscriminative powerCoherence Ratioquestion complexity
0
0 comments X

The pith

A hierarchical framework finds that optimal RAG benchmark granularity varies by dimension rather than staying fixed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HieraRAG to guide how finely to split question categories when building RAG evaluation sets. It generates thousands of synthetic QA pairs across three dimensions at three different split levels and measures which level best separates performance scores from a fixed retrieval-plus-generation pipeline. Results show complexity reaches its highest spread of scores at the finest split while answer type and linguistic variation reach theirs at the medium split. A coherence check quantifies how cleanly the finer categories sit inside their parent groups and finds clear differences between dimensions. The method is offered as a reusable procedure that any practitioner can run on their own RAG system.

Core claim

HieraRAG defines optimal granularity as the category count that maximizes the standard deviation of generation quality across categories inside one fixed RAG configuration. On 5,872 synthetic pairs drawn from FineWeb-10BT, question complexity shows peak discriminative power of 0.053 at the eight-category level, whereas answer type and linguistic variation both peak at the four-category level. The introduced Coherence Ratio of 0.40 for complexity versus 1.44 for answer type indicates that fine splits subdivide parent categories cleanly in some dimensions but not others. Human review of 110 stratified pairs supports the quality of the synthetic data.

What carries the argument

HieraRAG, the procedure that generates questions at multiple granularity levels and selects the level maximizing standard deviation of quality scores within a chosen RAG pipeline.

If this is right

  • Benchmark designers should test multiple granularity levels per dimension instead of adopting a single fixed split.
  • The Coherence Ratio supplies an independent check on whether fine categories form meaningful subdivisions of coarser ones.
  • The same generation-plus-scoring loop can be repeated on any new RAG configuration to obtain its own optimal levels.
  • Synthetic data at controlled granularity levels makes it possible to isolate the effect of question properties on measured RAG performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If optimal levels turn out to be strongly configuration-dependent, then shared public benchmarks may need to publish their granularity choices alongside the questions.
  • The framework could be applied to additional dimensions such as multi-hop reasoning or domain specificity to test whether the pattern of varying optima generalizes.
  • A natural next measurement would be whether the granularity chosen by this method also improves correlation with human preference rankings of RAG outputs.

Load-bearing premise

That the standard deviation of generation quality across categories within one RAG pipeline is the right quantity for deciding which granularity level is optimal.

What would settle it

Re-running the full generation and scoring pipeline on the same questions but with a different retriever or generator and checking whether the category count that maximizes the standard deviation stays the same or shifts.

Figures

Figures reproduced from arXiv: 2606.12789 by Chase M. Fensore, Eugene Agichtein, Jason Fan, Joyce C. Ho, Kaustubh Dhole.

Figure 1
Figure 1. Figure 1: Hierarchical structure of three question dimensions (QC, AT, LV) across three granularity levels. Each dimension [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Demonstrated Coherence Ratio (𝜌) calculation for two medium to fine-grained splits within AT and LV dimen￾sions. 𝜌 > 2.0 indicates discriminative-yet-aligned children (preferred); 𝜌 < 1.0 suggests poor hierarchical structure. The interaction pattern reveals that when vocabulary is mis￾matched, complexity becomes irrelevant because retrieval has al￾ready failed (MAP=0.164–0.237 for distant vocabulary vs. 0.… view at source ↗
read the original abstract

Evaluating retrieval-augmented generation (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity. We present HieraRAG, a hierarchical framework for studying granularity in RAG benchmark construction, defining optimal granularity as the level that maximizes discriminative power (the standard deviation of generation quality across categories) within a given RAG configuration. As a case study, we generate 5,872 synthetic question-answer (QA) pairs from FineWeb-10BT across 3 dimensions (Question Complexity, Answer Type, Linguistic Variation) at 3 granularity levels (2, 4, and 8 categories). With a BM25+Falcon-3-10B pipeline, optimal granularity varies by dimension: complexity benefits from fine-grained distinctions (discriminative power: 0.053) while answer type and linguistic variation peak at medium granularity. We introduce a Coherence Ratio metric to quantify whether fine-grained splits cleanly subdivide parent categories, revealing structural differences across dimensions (Question Complexity: 0.40 vs. Answer Type: 1.44). Human evaluation of 110 stratified QA pairs confirms synthetic quality. While these specific findings reflect a single configuration, HieraRAG provides a portable procedure and validation metric for practitioners to determine evaluation granularity within their own RAG settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces HieraRAG, a hierarchical framework for studying granularity in RAG benchmark construction. It defines optimal granularity as the level maximizing discriminative power, measured as the standard deviation of RAG generation quality across categories within a configuration. As a case study, it generates 5,872 synthetic QA pairs from FineWeb-10BT across three dimensions (Question Complexity, Answer Type, Linguistic Variation) at three granularity levels (2/4/8 categories), evaluates them with a BM25+Falcon-3-10B pipeline, reports dimension-specific optima (fine-grained for complexity with discriminative power 0.053; medium for the other two), introduces a Coherence Ratio metric (values 0.40 vs. 1.44), and validates synthetic quality via human evaluation of 110 stratified pairs.

Significance. If the central empirical claim holds after correction, the work supplies a portable, replicable procedure and an auxiliary Coherence Ratio for practitioners to select evaluation granularity in their own RAG settings. The dimension-specific pattern, if robust, would be a useful empirical observation for benchmark design; the human validation and scale of the synthetic corpus are positive features.

major comments (1)
  1. [Definition of discriminative power (abstract and case-study metric)] Definition of discriminative power (abstract and the case-study section describing the metric): optimal granularity is selected by maximizing the raw standard deviation of generation quality across categories, yet the number of categories differs across levels (2 vs. 4 vs. 8). No normalization (e.g., division by sqrt(k), by the observed range, or by a null-model expectation) is described. Consequently the reported 0.053 value favoring the 8-category level for Question Complexity, and the medium-level peaks for the other dimensions, may be artifacts of unequal sample sizes rather than structural differences; this directly undermines the central claim that optimal granularity varies by dimension.
minor comments (2)
  1. The abstract states 5,872 pairs but the methods description should explicitly report any data exclusion criteria, per-category counts after filtering, and whether error bars or confidence intervals accompany the reported discriminative-power and Coherence Ratio values.
  2. Table or figure presenting the per-dimension, per-level quality scores and standard deviations would improve readability and allow readers to verify the unnormalized comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying a substantive methodological concern with the discriminative power metric. We agree that the lack of normalization for differing numbers of categories is a limitation that requires correction to support the central claim.

read point-by-point responses
  1. Referee: [Definition of discriminative power (abstract and case-study metric)] Definition of discriminative power (abstract and the case-study section describing the metric): optimal granularity is selected by maximizing the raw standard deviation of generation quality across categories, yet the number of categories differs across levels (2 vs. 4 vs 8). No normalization (e.g., division by sqrt(k), by the observed range, or by a null-model expectation) is described. Consequently the reported 0.053 value favoring the 8-category level for Question Complexity, and the medium-level peaks for the other dimensions, may be artifacts of unequal sample sizes rather than structural differences; this directly undermines the central claim that optimal granularity varies by dimension.

    Authors: We acknowledge the validity of this observation. The manuscript defines discriminative power strictly as the raw standard deviation without any adjustment for the number of categories (k=2,4,8), which can indeed inflate values for finer partitions even under a null model. In the revision we will (1) introduce a normalized variant of the metric (standard deviation divided by sqrt(k), with an alternative using the observed range also reported), (2) recompute all optimal-granularity conclusions under the normalized metric, and (3) present both the original and normalized results side-by-side with a brief discussion of when each may be appropriate. This change will be reflected in the abstract, the metric definition section, the case-study results, and the conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical definition of optimality via direct std-dev computation on generated data

full rationale

The paper explicitly defines 'optimal granularity' as the level maximizing 'discriminative power' (standard deviation of generation quality across categories) and reports the argmax per dimension after generating 5,872 QA pairs. This is a straightforward empirical procedure with no equations that reduce the reported optimum to a fitted parameter defined in terms of itself, no self-citation load-bearing premises, and no ansatz or uniqueness claims imported from prior work. The Coherence Ratio is introduced as a new metric without circular reduction. The approach is self-contained against external benchmarks and does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full paper may contain additional parameters or assumptions. The central claim rests on the chosen definition of optimality and the representativeness of synthetic data from FineWeb-10BT.

free parameters (2)
  • granularity levels
    Chose exactly three levels (2, 4, 8 categories) for the case study.
  • synthetic QA count = 5872
    Generated exactly 5,872 pairs for the reported experiments.
axioms (1)
  • domain assumption Discriminative power defined as standard deviation of generation quality across categories is the correct measure of optimal granularity.
    Explicitly stated in the abstract as the definition used to identify optimal granularity.

pith-pipeline@v0.9.1-grok · 5790 in / 1272 out tokens · 31389 ms · 2026-06-27T07:10:43.988832+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    Bruce Croft, and Mark Sanderson

    Valeriia Bolotova, Vladislav Blinov, Falk Scholer, W. Bruce Croft, and Mark Sanderson. 2022. A Non-Factoid Question-Answering Taxonomy. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 1196–1207. doi:10.1145/3477495.3531926

  2. [2]

    Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2024. Benchmarking large language models in retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 17754–17762

  3. [3]

    Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. RAGAs: Automated Evaluation of Retrieval Augmented Generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. 150–158

  4. [4]

    Simone Filice, Guy Horowitz, David Carmel, Zohar Karnin, Liane Lewin-Eytan, and Yoelle Maarek. 2025. Generating Diverse Q&A Benchmarks for RAG Evalua- tion with DataMorgana. arXiv:2501.12789 [cs] doi:10.48550/arXiv.2501.12789

  5. [5]

    Manish Gupta and Michael Bendersky. 2015. Information Retrieval with Verbose Queries.Foundations and Trends in Information Retrieval9, 3-4 (2015), 209–354. 5 HieraRAG, March 2026, Atlanta, GA, USA Chase M. Fensore, Kaustubh Dhole, Jason Fan, Eugene Agichtein, Joyce C. Ho

  6. [6]

    Jeffrey Ip and Kritin Vongthongsri. 2025. deepeval. https://github.com/confident- ai/deepeval original-date: 2023-08-10T05:35:04Z

  7. [7]

    and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research.Tr...

  8. [8]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

  9. [9]

    Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

  10. [10]

    Wanlong Liu, Junying Chen, Ke Ji, Li Zhou, Wenyu Chen, and Benyou Wang

  11. [11]

    InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Rag-instruct: Boosting llms with diverse retrieval-augmented instructions. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 3865–3888

  12. [12]

    Craig Macdonald, Nicola Tonellotto, Sean MacAvaney, and Iadh Ounis. 2021. PyTerrier: Declarative experimentation in Python from BM25 to dense retrieval. InProceedings of the 30th acm international conference on information & knowledge management. 4526–4533

  13. [13]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

  14. [14]

    Guilherme Penedo, Hynek Kydlíček, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al . 2024. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems37 (2024), 30811–30849

  15. [15]

    Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. 2021. KILT: a Bench- mark for Knowledge Intensive Language Tasks. doi:10.48550/arXiv.2009.02252 arXiv:2009.02252 [cs]

  16. [16]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 3982

  17. [17]

    Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journal of computational and applied mathematics 20 (1987), 53–65

  18. [18]

    Falcon-LLM Team. 2024. The Falcon 3 Family of Open Models. https: //huggingface.co/blog/falcon3

  19. [19]

    Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2010. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance.Journal of Machine Learning Research11 (2010), 2837–2854

  20. [20]

    Voorhees and D

    Ellen M. Voorhees and D. M. Tice. 2000. The TREC-8 Question Answering Track Evaluation.NIST3 (May 2000). https://www.nist.gov/publications/trec- 8-question-answering-track-evaluation Last Modified: 2017-02-17T13:34-05:00 Publisher: Ellen M. Voorhees, D M. Tice

  21. [21]

    QianYing Wang, Clifford Nass, and Jiang Hu. 2005. Natural language query vs. keyword search: Effects of task complexity on search performance, participant perceptions, and preferences. InIFIP Conference on Human-Computer Interaction. Springer, 106–116

  22. [22]

    Zhichao Wang, Bin Bi, Yanqi Luo, Sitaram Asur, and Claire Na Cheng. 2025. Diversity Enhances an LLM’s Performance in RAG and Long-context Task.arXiv preprint arXiv:2502.09017(2025)

  23. [23]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. doi:10.48550/arXiv.1809. 09600 arXiv:1809.09600 [cs]

  24. [24]

    Gal Yona, Roee Aharoni, and Mor Geva. 2024. Narrowing the Knowledge Evalu- ation Gap: Open-Domain Question Answering with Multi-Granularity Answers. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguisti...

  25. [25]

    Oleg Zendel, Sara Fahad Dawood Al Lawati, Lida Rashidi, Falk Scholer, and Mark Sanderson. 2025. A Comparative Analysis of Linguistic and Retrieval Diversity in LLM-Generated Search Queries. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 4014–4023. 6