pith. machine review for the scientific record. sign in

arxiv: 2604.27852 · v1 · submitted 2026-04-30 · 💻 cs.IR · cs.AI

Recognition: unknown

NeocorRAG: Less Irrelevant Information, More Explicit Evidence, and More Effective Recall via Evidence Chains

Authors on Pith no claims yet

Pith reviewed 2026-05-07 07:12 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords Retrieval-Augmented GenerationEvidence ChainsRecall Conversion RateMulti-hop Question AnsweringRetrieval QualityTraining-free OptimizationConstrained Decoding
0
0 comments X

The pith

NeocorRAG breaks the recall-quality tradeoff in retrieval-augmented generation by mining and applying Evidence Chains, delivering higher reasoning accuracy on far fewer tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mainstream RAG methods improve raw recall but suffer decaying conversion of that recall into correct reasoning answers, while quality-focused alternatives lose coverage. The paper quantifies this with a new Recall Conversion Rate metric showing near-linear decay as recall rises. It then introduces NeocorRAG, which refines candidate sets via activated search, produces precise Evidence Chains with constrained decoding, and feeds those chains back to optimize retrieval. On HotpotQA, 2WikiMultiHopQA, MuSiQue, and NQ, the result is state-of-the-art accuracy for both 3B and 70B models at under 20 percent of the token cost of comparable systems. A sympathetic reader cares because this offers a training-free way to make retrieved context actually usable for downstream reasoning rather than just more voluminous.

Core claim

The paper claims that retrieval quality optimization requires explicit Evidence Chains—linked sequences of supporting documents that supply traceable reasoning paths. By first running activated search to shrink the candidate space, then applying constrained decoding to guarantee chain validity, and finally using the resulting chains to re-rank and prune the retrieved set, NeocorRAG simultaneously raises recall and quality. This dual improvement raises the Recall Conversion Rate and produces higher end-task accuracy while consuming dramatically fewer tokens.

What carries the argument

Evidence Chains: sequences of connected evidence documents generated by activated search followed by constrained decoding, which are then used to guide final retrieval optimization.

If this is right

  • State-of-the-art accuracy is reached on HotpotQA, 2WikiMultiHopQA, MuSiQue, and NQ for both 3B and 70B models.
  • Token usage drops below 20 percent of comparable retrieval-augmented methods while recall stays high.
  • The approach is entirely training-free and can be applied on top of existing retrievers and generators.
  • The Recall Conversion Rate metric exposes a previously hidden linear decay between recall and reasoning utility in prior RAG systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evidence Chains may reduce hallucinations by making the reasoning path between retrieved facts and the final answer more explicit and checkable.
  • The activated-search plus constrained-decoding pattern could be reused in other structured-retrieval settings such as legal or medical document chains where traceability matters.
  • Because the method keeps the final context short, it may extend more easily to very long-context models that currently hit token limits.

Load-bearing premise

The quality gains from Evidence Chains will translate into higher downstream reasoning accuracy on unseen tasks and model scales without creating new failure modes.

What would settle it

Run NeocorRAG on a fresh multi-hop QA dataset never seen during its design and measure whether accuracy gains disappear even though Recall Conversion Rate still rises.

Figures

Figures reproduced from arXiv: 2604.27852 by Haihong E, Jiacheng Liu, Jiayu Huang, Qianhe Zheng, Qing Huang, Rongjin Li, Shiyao Peng, Yifan Zhu, Zhuodi Hao, Zichen Tang.

Figure 1
Figure 1. Figure 1: The inadequacy of recall metrics in capturing re view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of existing methods based on proposed retrieval quality optimization criteria, evaluating two method view at source ↗
Figure 3
Figure 3. Figure 3: Overall framework of NeocorRAG. Our approach enhances retrieval quality through three sequential stages. (a) view at source ↗
Figure 4
Figure 4. Figure 4: Token efficiency comparison (tokens per question view at source ↗
Figure 5
Figure 5. Figure 5: Contribution of evidence chains to the F1 score view at source ↗
Figure 6
Figure 6. Figure 6: F1 Score vs. evidence chain count across model families and sizes. This figure uses Llama-3.2-3B as the QA model view at source ↗
read the original abstract

Although precise recall is a core objective in Retrieval-Augmented Generation (RAG), a critical oversight persists in the field: improvements in retrieval performance do not consistently translate to commensurate gains in downstream reasoning. To diagnose this gap, we propose the Recall Conversion Rate (RCR), a novel evaluation metric to quantify the contribution of retrieval to reasoning accuracy. Our quantitative analysis of mainstream RAG methods reveals that as Recall@5 improves, the RCR exhibits a near-linear decay. We identify the neglect of retrieval quality in these methods as the underlying cause. In contrast, approaches that focus solely on quality optimization often suffer from inferior recall performance. Both categories lack a comprehensive understanding of retrieval quality optimization, resulting in a trade-off dilemma. To address these challenges, we propose comprehensive retrieval quality optimization criteria and introduce the NeocorRAG framework. This framework achieves holistic retrieval quality optimization by systematically mining and utilizing Evidence Chains. Specifically, NeocorRAG first employs an innovative activated search algorithm to obtain a refined candidate space. Then it ensures precise evidence chain generation through constrained decoding. Finally, the retrieved set of evidence chains guides the retrieval optimization process. Evaluated on benchmarks including HotpotQA, 2WikiMultiHopQA, MuSiQue, and NQ, NeocorRAG achieves SOTA performance on both 3B and 70B parameter models, while consuming less than 20% of tokens used by comparable methods. This study presents an efficient, training-free paradigm for RAG enhancement that effectively optimizes retrieval quality while maintaining high recall. Our code is released at https://github.com/BUPT-Reasoning-Lab/NeocorRAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Recall Conversion Rate (RCR) metric to quantify how much retrieval contributes to downstream reasoning accuracy in RAG. It reports that RCR exhibits near-linear decay as Recall@5 rises across mainstream methods, attributing this to insufficient attention to retrieval quality, and contrasts this with quality-focused methods that sacrifice recall. To resolve the trade-off, the authors propose NeocorRAG, a training-free framework that (1) uses activated search to refine the candidate pool, (2) applies constrained decoding to mine Evidence Chains, and (3) leverages the resulting chains to guide retrieval optimization. The framework is evaluated on HotpotQA, 2WikiMultiHopQA, MuSiQue, and NQ, claiming SOTA results for both 3B and 70B models while consuming less than 20% of the tokens used by comparable baselines.

Significance. If the empirical results and the reliability of Evidence Chain generation are confirmed, NeocorRAG would offer a practical, training-free route to improving RAG reasoning by explicitly structuring high-quality evidence rather than relying on raw recall. The RCR metric provides a useful diagnostic tool for future work. The public code release supports reproducibility. The significance is currently limited by the absence of detailed verification that the core mechanisms (especially constrained decoding) function reliably across model scales.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experimental Evaluation): The central claim that NeocorRAG achieves SOTA performance on 3B-parameter models rests on the assumption that constrained decoding reliably produces usable Evidence Chains. No per-model decoding success rates, fallback procedures, or ablations that remove or isolate failed chains are reported. Because smaller models are known to have weaker instruction-following under constraints, the 3B gains could be artifacts of selective evaluation or implicit fallbacks to weaker retrieval, directly undermining the 'no new failure modes' and 'works on both 3B and 70B' assertions.
  2. [§2] §2 (Motivation and RCR Analysis): The observation of near-linear RCR decay is presented as the key motivation for the framework. However, the text supplies no numerical tables, data points, or statistical tests supporting the decay claim or the causal attribution to 'neglect of retrieval quality.' Without these, it is impossible to assess whether the decay is robust or whether the proposed quality criteria actually address the diagnosed problem.
  3. [§3] §3 (NeocorRAG Framework): The description of how Evidence Chains are integrated into the final retrieval optimization step is high-level. It is unclear how the 'comprehensive retrieval quality optimization criteria' are formally defined, how conflicts between chain quality and recall are resolved, and whether the process introduces new biases. This mechanism is load-bearing for the claim that the method simultaneously improves quality and maintains high recall.
minor comments (2)
  1. [Abstract] Abstract: Adding one or two concrete performance deltas or RCR values would make the SOTA claim more informative without lengthening the abstract excessively.
  2. [§4] The paper would benefit from an explicit comparison table that includes recent multi-hop RAG baselines (e.g., those using iterative retrieval or chain-of-thought prompting) rather than only 'comparable methods.'

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. These highlight important areas for improving transparency and rigor. We address each major comment point-by-point below, committing to revisions that add the requested data, formalizations, and analyses without altering the core claims or results.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental Evaluation): The central claim that NeocorRAG achieves SOTA performance on 3B-parameter models rests on the assumption that constrained decoding reliably produces usable Evidence Chains. No per-model decoding success rates, fallback procedures, or ablations that remove or isolate failed chains are reported. Because smaller models are known to have weaker instruction-following under constraints, the 3B gains could be artifacts of selective evaluation or implicit fallbacks to weaker retrieval, directly undermining the 'no new failure modes' and 'works on both 3B and 70B' assertions.

    Authors: We agree that explicit verification of constrained decoding reliability is necessary, particularly for the 3B model. In the revised manuscript, we will add a new subsection in §4.2 reporting per-model success rates (percentage of queries yielding valid Evidence Chains) for both 3B and 70B models across all four datasets, along with the exact fallback procedure: on decoding failure, the system uses the top-5 passages from activated search without chain optimization. We will also include an ablation table isolating performance on successful-chain queries versus fallback queries, showing that overall gains remain consistent and no new failure modes are introduced beyond those of standard retrieval. These additions directly address the concern and confirm the claims hold across scales. revision: yes

  2. Referee: [§2] §2 (Motivation and RCR Analysis): The observation of near-linear RCR decay is presented as the key motivation for the framework. However, the text supplies no numerical tables, data points, or statistical tests supporting the decay claim or the causal attribution to 'neglect of retrieval quality.' Without these, it is impossible to assess whether the decay is robust or whether the proposed quality criteria actually address the diagnosed problem.

    Authors: The referee correctly identifies that §2 would be strengthened by explicit supporting data. Although the near-linear decay was observed across our experiments, we will revise §2 to include a new table (Table 1) with Recall@5 and RCR values for multiple mainstream methods on HotpotQA and 2WikiMultiHopQA. We will also add a figure with a scatter plot of RCR versus Recall@5, a fitted linear regression, and the associated R² value plus correlation coefficient to quantify the trend. A short paragraph will discuss how the quality-focused criteria in NeocorRAG target the diagnosed cause. These changes make the motivation fully verifiable. revision: yes

  3. Referee: [§3] §3 (NeocorRAG Framework): The description of how Evidence Chains are integrated into the final retrieval optimization step is high-level. It is unclear how the 'comprehensive retrieval quality optimization criteria' are formally defined, how conflicts between chain quality and recall are resolved, and whether the process introduces new biases. This mechanism is load-bearing for the claim that the method simultaneously improves quality and maintains high recall.

    Authors: We acknowledge the description in §3 is high-level and will expand it substantially. In the revision, §3.3 will include formal definitions: the optimization criteria as a constrained multi-objective function maximizing chain coherence + coverage subject to recall@5 ≥ τ (with explicit formulas for each term). We will add pseudocode for the full optimization step and explain conflict resolution via a recall-first lexicographic ordering (quality improvements are accepted only if recall does not drop below threshold). A new paragraph will analyze potential biases, supported by results on the single-hop NQ dataset showing no degradation relative to baselines. These clarifications will make the mechanism transparent while preserving the quality-recall balance claim. revision: yes

Circularity Check

0 steps flagged

No circularity; new metric and algorithmic framework are independently defined and empirically tested

full rationale

The paper defines Recall Conversion Rate (RCR) as a fresh metric quantifying retrieval's contribution to downstream accuracy, reports an empirical observation of near-linear RCR decay versus Recall@5 on mainstream methods, and introduces NeocorRAG as a distinct training-free procedure (activated search + constrained decoding + evidence-chain guidance). No equations, parameters, or predictions are shown to reduce by construction to prior fitted values or self-citations; the SOTA claims on 3B/70B models rest on benchmark evaluations of the new procedure rather than tautological re-labeling of inputs. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Review performed on abstract only; the central claims rest on the unverified assumption that Evidence Chains improve reasoning conversion and on the domain premise that retrieval quality can be optimized without recall loss. No free parameters are explicitly named in the abstract.

axioms (2)
  • domain assumption Retrieval quality improvements will produce commensurate gains in downstream reasoning accuracy
    This premise underlies both the RCR analysis and the design of NeocorRAG.
  • ad hoc to paper Evidence Chains mined via activated search and constrained decoding constitute higher-quality retrieval sets
    Core mechanism introduced by the framework.
invented entities (2)
  • Evidence Chains no independent evidence
    purpose: Explicit, linked evidence structures that reduce irrelevant information while preserving recall
    New construct central to the optimization loop; no independent falsifiable prediction supplied in abstract.
  • Recall Conversion Rate (RCR) no independent evidence
    purpose: Metric quantifying the fraction of retrieved content that contributes to correct reasoning
    Novel evaluation measure proposed to diagnose the recall-reasoning gap.

pith-pipeline@v0.9.0 · 5642 in / 1623 out tokens · 112089 ms · 2026-05-07T07:12:45.141606+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 33 canonical work pages · 3 internal anchors

  1. [1]

    Muhammad Arslan, Hussam Ghanem, Saba Munawar, and Christophe Cruz. 2024. A Survey on RAG with LLMs.Procedia Computer Science246 (2024), 3781–3790. doi:10.1016/j.procs.2024.09.178 28th International Conference on Knowledge Based and Intelligent information and Engineering Systems (KES 2024)

  2. [2]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi

  3. [3]

    InThe Twelfth International Conference on Learning Representations

    Self-RAG: Learning to Retrieve, Generate, and Critique through Self- Reflection. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=hSyW5go0v8

  4. [4]

    Michael Buckland and Fredric Gey. 1994. The relationship between Recall and Precision.Journal of the American Society for Information Science45, 1 (1994), 12–19. doi:10.1002/(SICI)1097-4571(199401)45:1<12::AID-ASI2>3.0.CO;2-L

  5. [5]

    Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, and Dong Yu. 2024. Dense X Retrieval: What Retrieval Granu- larity Should We Use?. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguist...

  6. [6]

    Nadezhda Chirkova, Thibault Formal, Vassilina Nikoulina, and Stéphane CLIN- CHANT. 2025. Provence: efficient and robust context pruning for retrieval- augmented generation. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=TDy5Ih78b4

  7. [7]

    Yang, and Anton Tsitsulin

    Jialin Dong, Bahare Fatemi, Bryan Perozzi, Lin F. Yang, and Anton Tsitsulin

  8. [8]

    Yang, and Anton Tsitsulin

    Don’t Forget to Connect! Improving RAG with Graph-based Reranking. arXiv:2405.18414 [cs.CL] https://arxiv.org/abs/2405.18414

  9. [9]

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(Barcelona, Spain) (KDD ’24). Association for Computing Machinery, New York, NY, U...

  10. [10]

    Jinyuan Fang, Zaiqiao Meng, and Craig MacDonald. 2024. TRACE the Evidence: Constructing Knowledge-Grounded Reasoning Chains for Retrieval-Augmented Generation. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 84...

  11. [11]

    Edward Fredkin. 1960. Trie memory.Commun. ACM3, 9 (Sept. 1960), 490–499. doi:10.1145/367390.367400

  12. [12]

    Markus Freitag and Yaser Al-Onaizan. 2017. Beam Search Strategies for Neural Machine Translation. InProceedings of the First Workshop on Neural Machine Translation, Thang Luong, Alexandra Birch, Graham Neubig, and Andrew Finch (Eds.). Association for Computational Linguistics, Vancouver, 56–60. doi:10. 18653/v1/W17-3207

  13. [13]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 [cs.CL] https://arxiv.org/abs/2312.10997

  14. [14]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, et al . 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783

  15. [15]

    Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su

  16. [16]

    2024 , pages =

    HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Lan- guage Models. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 59532–59569. doi:10.52202/079017-1902

  17. [17]

    Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su

  18. [18]

    InProceedings of the 42nd International Conference on Ma- chine Learning (Proceedings of Machine Learning Research, Vol

    From RAG to Memory: Non-Parametric Continual Learning for Large Language Models. InProceedings of the 42nd International Conference on Ma- chine Learning (Proceedings of Machine Learning Research, Vol. 267), Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu (Eds.). PMLR, 21497–21515....

  19. [19]

    Rossi, Subhabrata Mukherjee, Xianfeng Tang, Qi He, Zhigang Hua, Bo Long, Tong Zhao, Neil Shah, Amin Javari, Yinglong Xia, and Jiliang Tang

    Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Ma- hantesh Halappanavar, Ryan A. Rossi, Subhabrata Mukherjee, Xianfeng Tang, Qi He, Zhigang Hua, Bo Long, Tong Zhao, Neil Shah, Amin Javari, Yinglong Xia, and Jiliang Tang. 2025. Retrieval-Augmented Generation with Graphs (GraphRAG). arXiv:2501.00309 [cs.IR] https://arxiv.org/abs/2501.00309

  20. [20]

    Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi

    Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V. Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. 2024. G-Retriever: Retrieval-Augmented Gener- ation for Textual Graph Understanding and Question Answering. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (...

  21. [21]

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reason- ing Steps. InProceedings of the 28th International Conference on Computational Linguistics, Donia Scott, Nuria Bel, and Chengqing Zong (Eds.). International Committee on Computational Linguistics, Barcelona, Sp...

  22. [22]

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bo- janowski, Armand Joulin, and Edouard Grave. 2022. Unsupervised Dense Infor- mation Retrieval with Contrastive Learning.Transactions on Machine Learning Research(2022). https://openreview.net/forum?id=jKN1pXi7b0

  23. [23]

    Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong Park

  24. [24]

    Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Li...

  25. [25]

    Gary Klein, Brian Moon, and Robert R. Hoffman. 2006. Making Sense of Sense- making 1: Alternative Perspectives .IEEE Intelligent Systems21, 04 (July 2006), 70–73. doi:10.1109/MIS.2006.75

  26. [26]

    and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research.Tr...

  27. [27]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F...

  28. [28]

    Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yongkang Wu, Zhonghua Li, Ye Qi, and Zhicheng Dou. 2025. RetroLLM: Empowering Large Language Models to Retrieve Fine- grained Evidence within Generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and M...

  29. [29]

    Linhao Luo, Zicheng Zhao, Gholamreza Haffari, Yuan-Fang Li, Chen Gong, and Shirui Pan. 2025. Graph-constrained Reasoning: Faithful Reasoning on Knowl- edge Graphs with Large Language Models. InProceedings of the 42nd Interna- tional Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267), Aarti Singh, Maryam Fazel, Daniel Hsu, ...

  30. [30]

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, an...

  31. [31]

    Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, and Yinfei Yang. 2022. Large Dual Encoders Are Generalizable Retrievers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association f...

  32. [32]

    Tong Niu, Shafiq Joty, Ye Liu, Caiming Xiong, Yingbo Zhou, and Semih Yavuz

  33. [33]

    Judgerank: Leveraging large language mod- els for reasoning-intensive reranking.arXiv preprint arXiv:2411.00142,

    JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking. arXiv:2411.00142 [cs.CL] https://arxiv.org/abs/2411.00142

  34. [34]

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  35. [35]

    1995.Okapi at TREC-3

    Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995.Okapi at TREC-3. British Library Research and Devel- opment Department

  36. [36]

    Diego Sanmartin. 2024. KG-RAG: Bridging the Gap Between Knowledge and Creativity. arXiv:2405.12035 [cs.AI] https://arxiv.org/abs/2405.12035

  37. [37]

    Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Marine Carpuat, Marie-Catherine de Marn...

  38. [38]

    doi:10.18653/v1/2022.naacl-main.272

  39. [39]

    Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. 2024. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=GN921JHCRw

  40. [40]

    Sara Sherif Daoud Saad and Stephanie Silva. 2025. Graph-Enhanced RAG: A Survey of Methods, Architectures, and Performance

  41. [41]

    Wendy A. Suzuki. 2005. Associative Learning and the Hippocampus. doi:10.1037/ e400222005-005

  42. [43]

    M u S i Q ue: Multihop questions via single-hop question composition

    MuSiQue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics10 (2022), 539–554. doi:10.1162/tacl_a_00475

  43. [44]

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal

  44. [45]

    Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

    Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge- Intensive Multi-Step Questions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 10014–10037. doi:10....

  45. [46]

    Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Im- proves Chain of Thought Reasoning in Language Models. InThe Eleventh Inter- national Conference on Learning Representations. https://openreview.net/forum? id=1PL1NIMMrw

  46. [47]

    Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. 2013. A Theoretical Analysis of NDCG Type Ranking Measures. InProceedings of the 26th Annual Conference on Learning Theory (Proceedings of Machine Learning Research, Vol. 30), Shai Shalev-Shwartz and Ingo Steinwart (Eds.). PMLR, Princeton, NJ, USA, 25–54. https://proceedings.mlr.press/v30/Wang13.html

  47. [48]

    Ziting Wang, Haitao Yuan, Wei Dong, Gao Cong, and Feifei Li. 2024. CORAG: A Cost-Constrained Retrieval Optimization System for Retrieval-Augmented Generation. arXiv:2411.00744 [cs.DB] https://arxiv.org/abs/2411.00744

  48. [49]

    Qikai Wei, Huansheng Ning, Chunlong Han, and Jianguo Ding. 2025. A Query- Aware Multi-Path Knowledge Graph Fusion Approach for Enhancing Retrieval- Augmented Generation in Large Language Models. arXiv:2507.16826 [cs.IR] https://arxiv.org/abs/2507.16826

  49. [50]

    Witten, Stefan J

    Ian H. Witten, Stefan J. Boddie, David Bainbridge, and Rodger J. McNab. 2000. Greenstone: a comprehensive open-source digital library software system. InPro- ceedings of the Fifth ACM Conference on Digital Libraries(San Antonio, Texas, USA) (DL ’00). Association for Computing Machinery, New York, NY, USA, 113–121. doi:10.1145/336597.336650

  50. [51]

    Witten, Gordon W

    Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. Nevill- Manning. 1999. KEA: practical automatic keyphrase extraction. InProceedings of the Fourth ACM Conference on Digital Libraries(Berkeley, California, USA) (DL ’99). Association for Computing Machinery, New York, NY, USA, 254–255. doi:10.1145/313238.313437

  51. [52]

    Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. 2024. C-Pack: Packed Resources For General Chinese Embed- dings. InProceedings of the 47th International ACM SIGIR Conference on Re- search and Development in Information Retrieval(Washington DC, USA)(SI- GIR ’24). Association for Computing Machinery, New York, NY, USA...

  52. [53]

    Xin Xie, Ningyu Zhang, Zhoubo Li, Shumin Deng, Hui Chen, Feiyu Xiong, Mosha Chen, and Huajun Chen. 2022. From Discrimination to Generation: Knowledge Graph Completion with Generative Transformer. InCompanion Proceedings of the Web Conference 2022(Virtual Event, Lyon, France)(WWW ’22). Association for Computing Machinery, New York, NY, USA, 162–165. doi:10...

  53. [54]

    Derong Xu, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Maolin Wang, Qidong Liu, Xiangyu Zhao, Yichao Wang, Huifeng Guo, Ruiming Tang, Enhong Chen, and Tong Xu. 2026. Align-GRAG: Anchor and Rationale Guided Dual Alignment for Graph Retrieval-Augmented Generation. arXiv:2505.16237 [cs.CL] https: //arxiv.org/abs/2505.16237

  54. [55]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (...

  55. [56]

    Peitian Zhang, Zheng Liu, Shitao Xiao, Zhicheng Dou, and Jian-Yun Nie. 2024. A Multi-Task Embedder For Retrieval Augmented LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailan...

  56. [57]

    Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. 2026. Retrieval- Augmented Generation for AI-Generated Content: A Survey.Data Science and Engineering(02 Jan 2026). doi:10.1007/s41019-025-00335-5

  57. [58]

    Fantastic Four

    Tolga Şakar and Hakan Emekci. 2025. Maximizing RAG efficiency: A comparative analysis of RAG methods.Natural Language Processing31, 1 (2025), 1–25. doi:10. 1017/nlp.2024.53 A Dataset Details In our experiments withNeocorRAG, we adopt the same Wikipedia snapshot and retrieval pipeline asHippoRAG2. Each sampled ques- tion is matched against the identical do...