pith. machine review for the scientific record. sign in

arxiv: 2604.18234 · v1 · submitted 2026-04-20 · 💻 cs.IR · cs.AI

Recognition: unknown

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies

Lorenz Brehme, Ruth Breu, Thomas Str\"ohle

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:58 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords RAGmulti-hop reasoningLLM-as-judgeretriever evaluationCAREHotPotQAcontext-aware evaluation
0
0 comments X

The pith

Providing the full set of retrieved contexts improves LLM judges' accuracy in evaluating multi-hop RAG retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates different strategies for using large language models as judges to assess the quality of document retrieval in retrieval-augmented generation systems, focusing on multi-hop questions that require combining information from multiple sources. It introduces Context-Aware Retriever Evaluation (CARE), which supplies the entire collection of retrieved contexts to the judge at once rather than assessing documents in isolation. Experiments on HotPotQA, MuSiQue, and SQuAD show that CARE aligns better with expected relevance labels than previous methods, with stronger benefits for larger models that can handle longer contexts. A sympathetic reader would care because poor evaluation of retrievers can lead to unreliable RAG systems when queries are complex. Single-hop queries, by contrast, show little sensitivity to the evaluation approach.

Core claim

CARE consistently outperforms existing LLM-based evaluation methods for multi-hop reasoning in RAG systems by evaluating the collective support provided by all retrieved contexts, with performance gains most pronounced in models with larger parameter counts and longer context windows.

What carries the argument

Context-Aware Retriever Evaluation (CARE): an LLM-as-judge strategy that presents the complete retrieved context set to determine if the passages together support the answer to a multi-hop query.

If this is right

  • CARE yields larger accuracy improvements when applied to bigger LLMs with extended context lengths.
  • Minimal differences appear between methods when evaluating single-hop queries.
  • The results underscore the importance of context awareness for reliable assessment of RAG retrievers in complex scenarios.
  • The method works across LLMs from different providers such as OpenAI, Meta, and Google.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • RAG systems could adopt similar collective context assessment during the retrieval phase itself to improve document selection for multi-hop queries.
  • The evaluation approach may apply to other domains involving multi-step information synthesis beyond the tested datasets.
  • Additional human studies could validate the LLM judgments to increase confidence in CARE's assessments.

Load-bearing premise

LLM judges can be trusted to determine whether a group of contexts collectively supports a multi-hop answer without further human validation of the judgments.

What would settle it

A study in which human annotators independently judge the same retrieval sets for multi-hop support and compare agreement rates with CARE versus baseline methods would falsify the claim if CARE shows no improvement in alignment.

Figures

Figures reproduced from arXiv: 2604.18234 by Lorenz Brehme, Ruth Breu, Thomas Str\"ohle.

Figure 1
Figure 1. Figure 1: Illustration of considered evaluation strategies [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompt length by approach * statistically significant difference; Base denotes baseline. Single-Hop. To assess whether the CARE approach–despite its longer prompts and higher computational cost, is still suitable for single-hop queries, we com￾pared it to the direct evaluation method. Due to the lack of ground-truth answers for unanswerable questions in the SQuAD 2.0 dataset, we excluded the indirect appro… view at source ↗
read the original abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop queries, where individual contexts may appear irrelevant in isolation but are essential when combined. In this research, we use the HotPotQA, MuSiQue, and SQuAD datasets to simulate a RAG system and compare three LLM-as-judge evaluation strategies, including our proposed Context-Aware Retriever Evaluation (CARE). Our goal is to better understand how multi-hop reasoning can be most effectively evaluated in RAG systems. Experiments with LLMs from OpenAI, Meta, and Google demonstrate that CARE consistently outperforms existing methods for evaluating multi-hop reasoning in RAG systems. The performance gains are most pronounced in models with larger parameter counts and longer context windows, while single-hop queries show minimal sensitivity to context-aware evaluation. Overall, the results highlight the critical role of context-aware evaluation in improving the reliability and accuracy of retrieval-augmented generation systems, particularly in complex query scenarios. To ensure reproducibility, we provide the complete data of our experiments at https://github.com/lorenzbrehme/CARE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Context-Aware Retriever Evaluation (CARE), an LLM-as-judge strategy for assessing multi-hop reasoning in RAG systems. It compares CARE against existing methods on HotPotQA, MuSiQue, and SQuAD by simulating RAG retrieval and using LLMs from OpenAI, Meta, and Google as judges. The central claim is that CARE consistently outperforms baselines, with larger gains for models having more parameters and longer context windows; single-hop queries show little difference. The manuscript provides a GitHub link for full experimental data.

Significance. If validated, the work would usefully highlight limitations of single-context evaluation for multi-hop RAG and offer a practical alternative. The multi-model experiments and public data release are strengths that support reproducibility. However, the significance is currently limited by untested assumptions about judge reliability and simulation fidelity.

major comments (2)
  1. [Experiments and Results] The central claim that CARE 'consistently outperforms' existing methods for multi-hop evaluation rests on LLM judges scoring whether retrieved contexts collectively entail the answer. No human agreement rates (e.g., Cohen's kappa), error analysis on multi-hop cases, or validation of judge reliability are reported anywhere in the experimental results or methodology. This is load-bearing because the performance comparison cannot be trusted without evidence that the judges themselves are accurate on collective support.
  2. [Methodology / Dataset Simulation] The RAG simulation injects gold supporting facts from the datasets rather than outputs from an actual retriever (BM25, dense, etc.). Consequently, the evaluation never encounters the partial, irrelevant, or noisy contexts that real RAG systems produce. This directly undermines generalization of the 'outperforms' result to practical retriever evaluation, as stated in the abstract and introduction.
minor comments (1)
  1. [Abstract and Introduction] The abstract and introduction refer to 'three LLM-as-judge evaluation strategies' but do not explicitly name the two baselines; a clear enumeration in §3 or §4 would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of validation and experimental design that we address below with planned revisions.

read point-by-point responses
  1. Referee: [Experiments and Results] The central claim that CARE 'consistently outperforms' existing methods for multi-hop evaluation rests on LLM judges scoring whether retrieved contexts collectively entail the answer. No human agreement rates (e.g., Cohen's kappa), error analysis on multi-hop cases, or validation of judge reliability are reported anywhere in the experimental results or methodology. This is load-bearing because the performance comparison cannot be trusted without evidence that the judges themselves are accurate on collective support.

    Authors: We agree that the absence of human validation for the LLM judges limits the strength of our claims regarding absolute reliability. While our experiments compare methods under identical judge conditions, allowing relative differences to be observed, we recognize the need for direct evidence of judge accuracy on collective entailment. In the revised manuscript, we will add a human evaluation study: a random sample of multi-hop instances from each dataset will be annotated by multiple human raters to determine whether the provided contexts collectively support the answer. We will report Cohen's kappa for inter-human agreement and human-LLM agreement, along with a qualitative error analysis of disagreements, particularly on multi-hop cases. This will be included in a new subsection of the experiments. revision: yes

  2. Referee: [Methodology / Dataset Simulation] The RAG simulation injects gold supporting facts from the datasets rather than outputs from an actual retriever (BM25, dense, etc.). Consequently, the evaluation never encounters the partial, irrelevant, or noisy contexts that real RAG systems produce. This directly undermines generalization of the 'outperforms' result to practical retriever evaluation, as stated in the abstract and introduction.

    Authors: The simulation intentionally uses gold supporting facts to isolate the impact of context-aware judgment on multi-hop reasoning without introducing retrieval noise as a confounding factor. This design choice enables a controlled comparison of how judges handle distributed information across contexts. We acknowledge that this does not replicate the partial or irrelevant contexts typical of real retrievers, which restricts direct claims about performance in deployed RAG systems. In the revision, we will update the abstract, introduction, and methodology to explicitly describe the simulation as an idealized setting for evaluating multi-hop judgment strategies. We will also add a dedicated limitations paragraph discussing the gap to real retrievers and outlining future work that applies CARE to outputs from BM25 and dense retrievers on the same datasets. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison of LLM judges on public datasets

full rationale

The paper conducts an empirical study comparing three LLM-as-judge strategies (including the proposed CARE) for multi-hop RAG evaluation on HotPotQA, MuSiQue, and SQuAD. It simulates retrieval by injecting dataset-provided supporting facts and measures performance via LLM scoring of collective entailment. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the derivation chain; the central claim (CARE outperforms baselines) is a direct experimental outcome on fixed public data rather than a reduction to its own inputs by construction. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical comparison study with no free parameters, axioms, or invented entities required for the central claim.

pith-pipeline@v0.9.0 · 5538 in / 1066 out tokens · 51065 ms · 2026-05-10T03:58:15.686106+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 26 canonical work pages · 5 internal anchors

  1. [1]

    https://doi.org/10.48550/arXiv.2407.05925, http://arxiv.org/abs/2407.05925

    Afzal, A., Kowsik, A., Fani, R., Matthes, F.: Towards optimizing and evaluating a retrieval augmented QA chatbot using LLMs with human in the loop (2024). https://doi.org/10.48550/arXiv.2407.05925, http://arxiv.org/abs/2407.05925

  2. [2]

    AI@Meta: Llama 3.1 model card (2024), https://github.com/meta-llama/llama- models/blob/main/models/llama3_1/MODEL_CARD.md

  3. [3]

    https://doi.org/10.48550/arXiv.2406.06458, http://arxiv.org/abs/2406.06458

    Alinejad, A., Kumar, K., Vahdat, A.: Evaluating the retrieval component in LLM-based question answering systems (2024). https://doi.org/10.48550/arXiv.2406.06458, http://arxiv.org/abs/2406.06458

  4. [4]

    Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025

    Bai, Y., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y., Tang, J., Li, J.: LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks (2025). https://doi.org/10.48550/arXiv.2412.15204, http://arxiv.org/abs/2412.15204

  5. [5]

    https://doi.org/10.48550/arXiv.2508.14066, http://arxiv.org/abs/2508.14066

    Brehme, L., Dornauer, B., Ströhle, T., Ehrhart, M., Breu, R.: Retrieval- augmented generation in industry: An interview study on use cases, require- ments, challenges, and evaluation. https://doi.org/10.48550/arXiv.2508.14066, http://arxiv.org/abs/2508.14066

  6. [6]

    https://doi.org/10.1109/SDS66131.2025.00010, https://ieeexplore.ieee.org/document/11081490, ISSN: 2835-3420

    Brehme, L., Ströhle, T., Breu, R.: Can LLMs be trusted for evaluating RAG systems? a survey of methods and datasets (2025). https://doi.org/10.1109/SDS66131.2025.00010, https://ieeexplore.ieee.org/document/11081490, ISSN: 2835-3420

  7. [7]

    Language Models are Few-Shot Learners

    Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Win- ter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford...

  8. [8]

    DeepMind, G.: Gemini models | gemini API (2025), https://ai.google.dev/gemini- api/docs/models

  9. [9]

    arXiv preprint arXiv:2409.03759v1 , year=

    Ding, T., Banerjee, A., Mombaerts, L., Li, Y., Borogovac, T., Weinstein, J.P.D.l.C.: VERA: Validation and evaluation of retrieval-augmented systems (2024). https://doi.org/10.48550/arXiv.2409.03759, http://arxiv.org/abs/2409.03759

  10. [10]

    Chapman and Hall/CRC, New York (1994)

    Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman and Hal- l/CRC (1994). https://doi.org/10.1201/9780429246593

  11. [11]

    RAGAS: Automated evalu- ation of retrieval-augmented generation.arXiv preprint arXiv:2309.15217,

    Es, S., James, J., Espinosa-Anke, L., Schockaert, S.: RAGAS: Automated evaluation of retrieval augmented generation (2023). https://doi.org/10.48550/arXiv.2309.15217, http://arxiv.org/abs/2309.15217

  12. [12]

    n/a, editor

    Friel, R., Belyi, M., Sanyal, A.: RAGBench: Explainable benchmark for retrieval-augmented generation systems (2024). https://doi.org/10.48550/arXiv.2407.11005, http://arxiv.org/abs/2407.11005 14 Brehme et al

  13. [13]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., Wang, H.: Retrieval-augmented generation for large lan- guage models: A survey (2024). https://doi.org/10.48550/arXiv.2312.10997, http://arxiv.org/abs/2312.10997

  14. [14]

    https://doi.org/10.1109/ICCCS61882.2024.10603291, https://ieeexplore.ieee.org/document/10603291

    Kukreja, S., Kumar, T., Bharate, V., Purohit, A., Dasgupta, A., Guha, D.: Performance evaluation of vector embeddings with retrieval-augmented generation (2024). https://doi.org/10.1109/ICCCS61882.2024.10603291, https://ieeexplore.ieee.org/document/10603291

  15. [15]

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Apr 2021), http://arxiv.org/abs/2005.11401, arXiv:2005.11401

  16. [16]

    Long-context llms struggle with long in-context learning.arXiv preprint arXiv:2404.02060, 2024

    Li, T., Zhang, G., Do, Q.D., Yue, X., Chen, W.: Long-context LLMs struggle with long in-context learning (2024). https://doi.org/10.48550/arXiv.2404.02060, http://arxiv.org/abs/2404.02060

  17. [17]

    https://doi.org/10.48550/arXiv.2410.12248, http://arxiv.org/abs/2410.12248

    Liu, J., Ding, R., Zhang, L., Xie, P., Huang, F.: CoFE-RAG: A com- prehensive full-chain evaluation framework for retrieval-augmented generation with enhanced data diversity (2024). https://doi.org/10.48550/arXiv.2410.12248, http://arxiv.org/abs/2410.12248

  18. [18]

    Lorenz Brehme, Thomas Ströhle, R.B.: General·lorenzbrehme/CARE, https://github.com/lorenzbrehme/CARE

  19. [19]

    https://doi.org/10.48550/arXiv.2409.07691, http://arxiv.org/abs/2409.07691

    Moreira, G.d.S.P., Ak, R., Schifferer, B., Xu, M., Osmulski, R., Oldridge, E.: Enhancing q&a text retrieval with ranking models: Benchmarking, fine-tuning and deploying rerankers for RAG. https://doi.org/10.48550/arXiv.2409.07691, http://arxiv.org/abs/2409.07691

  20. [20]

    OpenAI: Model - OpenAI API (2025), https://platform.openai.com

  21. [21]

    https://doi.org/10.48550/arXiv.2406.14783, http://arxiv.org/abs/2406.14783

    Rackauckas, Z., Câmara, A., Zavrel, J.: Evaluating RAG- fusion with RAGElo: an automated elo-based framework (2024). https://doi.org/10.48550/arXiv.2406.14783, http://arxiv.org/abs/2406.14783

  22. [22]

    Know What You Don't Know: Unanswerable Questions for SQuAD

    Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: Unanswer- able questions for SQuAD (2024). https://doi.org/10.48550/arXiv.1806.03822, http://arxiv.org/abs/1806.03822

  23. [23]

    preprint arXiv:2311.09476 (2023)

    Saad-Falcon, J., Khattab, O., Potts, C., Zaharia, M.: ARES: An auto- mated evaluation framework for retrieval-augmented generation systems (2024). https://doi.org/10.48550/arXiv.2311.09476, http://arxiv.org/abs/2311.09476

  24. [24]

    https://doi.org/10.48550/arXiv.2404.13781, http://arxiv.org/abs/2404.13781

    Salemi, A., Zamani, H.: Evaluating retrieval quality in retrieval- augmented generation (2024). https://doi.org/10.48550/arXiv.2404.13781, http://arxiv.org/abs/2404.13781

  25. [25]

    Tang and Y

    Tang, Y., Yang, Y.: MultiHop-RAG: Benchmarking retrieval-augmented gener- ation for multi-hop queries (2024). https://doi.org/10.48550/arXiv.2401.15391, http://arxiv.org/abs/2401.15391

  26. [26]

    Transactions of the Association for Computational Linguistics (2022)

    Trivedi, H., Balasubramanian, N., Khot, T., Sabharwal, A.: MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics (2022)

  27. [27]

    In: Proceedings of the 19th Australasian Doc- ument Computing Symposium

    Trotman, A., Puurula, A., Burgess, B.: Improvements to BM25 and lan- guage models examined. In: Proceedings of the 19th Australasian Doc- ument Computing Symposium. pp. 58–65. ADCS ’14, Association for Computing Machinery (2014). https://doi.org/10.1145/2682862.2682863, https://dl.acm.org/doi/10.1145/2682862.2682863 Multi-Hop Retriever Evaluation 15

  28. [28]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models (2023). https://doi.org/10.48550/arXiv.2201.11903, http://arxiv.org/abs/2201.11903

  29. [29]

    Large language models are diverse role-players for summarization evaluation

    Wu, N., Gong, M., Shou, L., Liang, S., Jiang, D.: Large language models are diverse role-players for summarization evaluation (2023). https://doi.org/10.48550/arXiv.2303.15078, http://arxiv.org/abs/2303.15078

  30. [30]

    https://doi.org/10.48550/arXiv.2206.00212, http://arxiv.org/abs/2206.00212

    Xu, L., Lian, J., Zhao, W.X., Gong, M., Shou, L., Jiang, D., Xie, X., Wen, J.R.: Negative sampling for contrastive representation learning: A review. https://doi.org/10.48550/arXiv.2206.00212, http://arxiv.org/abs/2206.00212

  31. [31]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W.W., Salakhutdi- nov, R., Manning, C.D.: HotpotQA: A dataset for diverse, explainable multi-hop question answering (2018). https://doi.org/10.48550/arXiv.1809.09600, http://arxiv.org/abs/1809.09600