pith. sign in

arxiv: 2606.05901 · v1 · pith:NWYULQQAnew · submitted 2026-06-04 · 💻 cs.CL · cs.AI

Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version)

Pith reviewed 2026-06-28 01:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords RAGhallucinationsgraph-based retrievalcomplex question answeringWikipedia QAMoNaCo benchmarkagentic systems
0
0 comments X

The pith

A lightweight graph with simple schema in RAG systems halves hallucinated answers and raises factual precision on complex Wikipedia questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a dedicated set of graph query tools added to a retrieval-augmented generation pipeline improves answers to complex questions drawn from the MoNaCo benchmark. It constructs an agentic system that combines vector search with graph traversals over a curated English Wikipedia subset organized by a minimal schema. The graph tools raise precision and recall of factual claims, cut the rate of hallucinated answers by half, and produce the highest fine-grained truthfulness scores among the three conditions tested. These gains occur with only a modest rise in token consumption.

Core claim

The introduction of graph-based tools can significantly increase the precision and recall of factual correctness, can halve the number of hallucinated answers, and achieves the highest fine-grained truthfulness score among the three evaluated scenarios, all with a modest increase in token usage.

What carries the argument

An agentic toolset of vector search and graph query tools operating over a lightweight graph with a simple schema built from curated Wikipedia articles.

If this is right

  • Precision and recall of factual correctness rise when graph queries supplement vector search.
  • The count of hallucinated answers drops by approximately half.
  • Fine-grained truthfulness scores reach the highest level among the tested retrieval setups.
  • Token usage increases only modestly compared with the non-graph baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same lightweight graph approach could be applied to proprietary document collections without retraining the underlying language model.
  • Simple graph schemas may generalize to other complex reasoning tasks that require multi-hop factual grounding.
  • Combining the graph tools with additional retrieval methods might yield further reductions in unsupported claims.

Load-bearing premise

The curated subset of English Wikipedia articles together with the chosen simple graph schema and agentic toolset are sufficient to support accurate retrieval for the complex queries in the MoNaCo benchmark.

What would settle it

Running the same agentic system on a different complex QA benchmark drawn from non-Wikipedia sources and finding no reduction in hallucination rate or truthfulness score.

Figures

Figures reproduced from arXiv: 2606.05901 by Christopher J. Wedge, Danny Dixon, Jacek Ca{\l}a, Joshua Stutter.

Figure 1
Figure 1. Figure 1: A Graph-based Question Answering pipeline embedded within evaluation framework. Highlighted in black are items [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The architecture of our QA system with three types [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Graph schema for the English Wikipedia dataset [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Fine-grained CRAG scores in three scenarios (left [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Coarse CRAG scores in three scenarios (left-to [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of tools called per question for the [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Large language models (LLMs) have fundamentally transformed the landscape of Natural Language Processing. Despite these advances, LLMs and LLM-based systems remain prone to a variety of failure modes. Retrieval-augmented generation (RAG) systems have emerged as a common deployment scenario seeking to both avoid the well known risk of the LLM "hallucinating" information, and to enable reasoning and question answering over proprietary information that the LLM did not have access to during training without resorting to expensive model fine-tuning. In this work, we explore the idea of using a lightweight graph structure with a relatively simple graph schema, to support the RAG subsystem via a dedicated toolset. We design an agentic system with a variety of vector search and graph query tools operating over a structured dataset based on a curated subset of English Wikipedia articles, and evaluate its performance on questions from MoNaCo, a challenging Wikipedia QA benchmark of complex query answering tasks. Our results show that the introduction of graph-based tools can significantly increase the precision and recall of factual correctness, can halve the number of hallucinated answers, and achieves the highest fine-grained truthfulness score among the three evaluated scenarios. All this with a modest increase in token usage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper describes an agentic RAG system that augments an LLM with both vector-search and graph-query tools over a graph constructed from a curated subset of English Wikipedia using a relatively simple schema. It evaluates the system on the MoNaCo benchmark of complex Wikipedia-based QA tasks and reports that the addition of graph-based tools raises precision and recall of factual correctness, halves the rate of hallucinated answers, and produces the highest fine-grained truthfulness score among the three evaluated configurations, at the cost of only a modest increase in token usage.

Significance. If the empirical claims are substantiated, the work would indicate that lightweight graph retrieval can materially reduce hallucinations on multi-hop factual questions without requiring elaborate schemas or fine-tuning, providing a practical, low-overhead extension to standard RAG pipelines.

major comments (2)
  1. [Abstract / system design paragraph] Abstract and system-design paragraph: the headline claim that graph tools raise precision/recall, halve hallucinations, and achieve top truthfulness on MoNaCo presupposes that the chosen simple schema plus agentic tools actually retrieve the facts required by the benchmark queries. No section demonstrates schema coverage of MoNaCo’s multi-hop or relational patterns, nor reports how many benchmark questions lie outside the schema’s expressivity; without this, measured gains cannot be attributed to graph retrieval.
  2. [Evaluation] Evaluation section (implied by abstract): the abstract states positive outcomes on precision, recall, hallucination rate, and truthfulness but supplies no experimental details, baseline definitions, statistical tests, or error analysis. The support for the central claim therefore cannot be verified from the given text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the two major comments point by point below, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract / system design paragraph] Abstract and system-design paragraph: the headline claim that graph tools raise precision/recall, halve hallucinations, and achieve top truthfulness on MoNaCo presupposes that the chosen simple schema plus agentic tools actually retrieve the facts required by the benchmark queries. No section demonstrates schema coverage of MoNaCo’s multi-hop or relational patterns, nor reports how many benchmark questions lie outside the schema’s expressivity; without this, measured gains cannot be attributed to graph retrieval.

    Authors: We agree that the manuscript would be strengthened by an explicit analysis of how well the chosen schema covers the relational and multi-hop patterns present in MoNaCo. No such coverage study appears in the current version. In the revision we will add a new subsection (in System Design or Evaluation) that (a) enumerates the schema’s supported relation types and path lengths, (b) samples MoNaCo questions to assess expressivity, and (c) reports an approximate coverage fraction. This will allow readers to judge how much of the measured improvement can be attributed to graph retrieval. revision: yes

  2. Referee: [Evaluation] Evaluation section (implied by abstract): the abstract states positive outcomes on precision, recall, hallucination rate, and truthfulness but supplies no experimental details, baseline definitions, statistical tests, or error analysis. The support for the central claim therefore cannot be verified from the given text.

    Authors: The full manuscript contains an Evaluation section that defines the three configurations (vector-only, graph-only, combined), the MoNaCo benchmark, the factual-correctness and truthfulness metrics, and the reported numerical results. However, the referee is correct that the current text lacks statistical significance tests and a systematic error analysis. We will add both in the revision: bootstrap confidence intervals or paired significance tests for the key deltas, plus a categorized error breakdown (e.g., retrieval failure vs. reasoning failure vs. schema limitation). revision: partial

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external benchmark

full rationale

The paper reports experimental results from an agentic RAG system with graph tools evaluated on the independent MoNaCo benchmark. Claims of improved precision/recall, halved hallucinations, and highest truthfulness scores rest on direct measurement against that external test set rather than any derivation, fitted parameters, or self-referential definitions. No equations, ansatzes, or load-bearing self-citations appear in the provided text; the graph schema is presented as a design choice whose coverage is tested via benchmark performance, not assumed by construction. This is a standard self-contained empirical comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical system evaluation; the abstract introduces no mathematical free parameters, no unproved axioms beyond standard NLP assumptions, and no new invented entities.

pith-pipeline@v0.9.1-grok · 5758 in / 1054 out tokens · 26406 ms · 2026-06-28T01:13:25.190680+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 23 canonical work pages

  1. [1]

    Jinheon Baek, Alham Fikri Aji, Jens Lehmann, and Sung Ju Hwang. 2023. Direct Fact Retrieval from Knowledge Graphs without Entity Linking. InProceedings of Christopher J. Wedge, Joshua Stutter, Danny Dixon, Jacek Cała the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada. https://doi.org/10.18653/v1/2023.acl-long.558

  2. [2]

    Thierry Blankenstein, Jialin Yu, Zixuan Li, Vassilis Plachouras, Sunando Sen- gupta, Philip Torr, Yarin Gal, Alasdair Paren, and Adel Bibi. 2026. BiasBusters: Uncovering and Mitigating Tool Selection Bias in Large Language Models. In Proceedings of the Fourteenth International Conference on Learning Representations. Rio de Janeiro, Brazil. https://doi.org...

  3. [3]

    2025.Essential GraphRAG: Knowledge Graph- Enhanced RAG

    Tomaž Bratanič and Oskar Hane. 2025.Essential GraphRAG: Knowledge Graph- Enhanced RAG. Manning Publications, Shelter Island, NY

  4. [4]

    Andrew Brown, Muhammad Roman, and Barry Devereux. 2025. A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges.Big Data and Cognitive Computing9 (2025), 320. Issue 12. https://doi.org/10.3390/bdcc9120320

  5. [5]

    Avi Caciularu, Arman Cohan, Iz Beltagy, Matthew Peters, Arie Cattan, and Ido Dagan. 2021. CDLM: Cross-Document Language Modeling. InFindings of the Association for Computational Linguistics. Punta Cana, Dominican Republic. https://doi.org/10.18653/v1/2021.findings-emnlp.225

  6. [6]

    Shulin Cao, Jiaxin Shi, Liangming Pan, Lunyiu Nie, Yutong Xiang, Lei Hou, Juanzi Li, Bin He, and Hanwang Zhang. 2022. KQA Pro: A Dataset with Explicit Compositional Programs for Complex Question Answering over Knowledge Base. InProceedings of the 60th Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers. Dublin, Ireland, 6101–6119...

  7. [7]

    Abir Chakraborty. 2024. Multi-hop Question Answering over Knowledge Graphs using Large Language Models. arXiv:2404.19234 https://arxiv.org/abs/2404.19234

  8. [8]

    Yun-Nung (Vivian) Chen, Margot Mieskes, and Siva Reddy. 2023. Retrieval-based Language Models and Applications. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Vol. 6: Tutorial Abstracts. Toronto, Canada, 41–46. https://doi.org/10.18653/v1/2023.acl-tutorials.6

  9. [9]

    [n.d.].The Developer’s Guide to GraphRAG

    Alison Cossette, Zach Blumenfeld, and Damaso Sanoja. [n.d.].The Developer’s Guide to GraphRAG. Neo4j, San Mateo, CA. https://neo4j.com/books/the- developers-guide-to-graphrag/

  10. [10]

    Minghang Deng, Ashwin Ramachandran, Canwen Xu, Lanxiang Hu, Zhewei Yao, Anupam Datta, and Hao Zhang. 2025. ReFoRCE: A Text-to-SQL Agent with Self- Refinement, Format Restriction, and Column Exploration. https://openreview. net/forum?id=OuFIfDBwQd. InProceedings of the ICLR 2025 Workshop VerifAI: AI Verification in the Wild. Singapore

  11. [11]

    Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge vault: A Web-scale Approach to Probabilistic Knowledge Fusion. InProceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, 601–610. https://doi.org/10.1145/2623...

  12. [12]

    Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi

    Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. 2023. Faith and Fate: Limits of Transformers on Composability. InPro- ceedings of the 37th Conference on Neural Informati...

  13. [13]

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2025. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130 [cs.CL] https://arxiv.org/abs/2404.16130

  14. [14]

    Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Çağatan, Akash Kundu, Martin Bernstorff, Shitao...

  15. [15]

    Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. RAGAs: Automated Evaluation of Retrieval Augmented Generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, St. Julians, Malta, 150–158. https://aclant...

  16. [16]

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Barcelona, Spain. https://doi.org/10.1145/3637528.3671470

  17. [17]

    Tengfei Feng and Liang He. 2025. RGR-KBQA: Generating Logical Forms for Question Answering Using Knowledge-Graph-Enhanced Large Language Model. InProceedings of the 31st International Conference on Computational Linguistics. Abu Dhabi, UAE, 3057–3070. https://aclanthology.org/2025.coling-main.205/

  18. [18]

    Yanlin Feng, Simone Papicchio, and Sajjadur Rahman. 2025. CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria. https://doi.org/10.18653/v1/2025.acl-long.438

  19. [19]

    Robert Friel, Masha Belyi, and Atindriyo Sanyal. 2025. RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. arXiv:2407.11005 [cs.CL] https://arxiv.org/abs/2407.11005

  20. [20]

    Aoran Gan, Hao Yu, Kai Zhang, Qi Liu, Wenyu Yan, Zhenya Huang, Shiwei Tong, and Guoping Hu. 2025. Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey. arXiv:2504.14891 [cs.CL] https://arxiv.org/abs/2504.14891

  21. [21]

    Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, and Xiaojun Wan. 2025. LLM-based NLG Evaluation: Current Status and Challenges. arXiv:2402.01383 [cs.CL] https://arxiv.org/abs/2402.01383

  22. [22]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 https: //arxiv.org/abs/2312.10997

  23. [23]

    Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. 2021. Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases. InProceedings of the Web Conference 2021 (WWW ’21). ACM, 3477—-3488. https://doi.org/10.1145/3442381.3449992

  24. [24]

    Willis Guo, Armin Toroghi, and Scott Sanner. 2024. CR-LT-KGQA: A Knowledge Graph Question Answering Dataset Requiring Commonsense Reasoning and Long-Tail Knowledge. arXiv:2403.01395 https://arxiv.org/abs/2403.01395

  25. [25]

    Rossi, Subhabrata Mukherjee, Xianfeng Tang, Qi He, Zhigang Hua, Bo Long, Tong Zhao, Neil Shah, Amin Javari, Yinglong Xia, and Jiliang Tang

    Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Ma- hantesh Halappanavar, Ryan A. Rossi, Subhabrata Mukherjee, Xianfeng Tang, Qi He, Zhigang Hua, Bo Long, Tong Zhao, Neil Shah, Amin Javari, Yinglong Xia, and Jiliang Tang. 2025. Retrieval-Augmented Generation with Graphs (GraphRAG). arXiv:2501.00309 https://arxiv.org/abs/2501.00309

  26. [26]

    Tianxing He, Jingyu Zhang, Tianle Wang, Sachin Kumar, Kyunghyun Cho, James Glass, and Yulia Tsvetkov. 2023. On the Blind Spots of Model-Based Evaluation Metrics for Text Generation. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). ...

  27. [27]

    Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. 2021. CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review.CoRRabs/2103.06268 (2021). arXiv:2103.06268 https://arxiv.org/abs/2103.06268

  28. [28]

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a Multi-hop QA Dataset for Comprehensive Evaluation of Reason- ing Steps. InProceedings of the 28th International Conference on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.580

  29. [29]

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models?. InProccedings of the First Confer- ence on Language Modeling 2024. https://openreview.net/forum?id=kIoBbc76Sy

  30. [30]

    Jiatan Huang, Mingchen Li, Zonghai Yao, Zhichao Yang, Yongkang Xiao, Feiyun Ouyang, Xiaohan Li, Shuo Han, and Hong Yu. 2024. RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs. arXiv:2410.13987 [cs.CL] https://arxiv.org/abs/2410.13987

  31. [31]

    Xiaolong Huang, Liang Wang, Furu Wei, Jingwen Lu, Knut Risvik, and Jason Li. 2026. Microsoft Open-Sources Industry-Leading Embedding Model. https://blogs.bing.com/search/April-2026/Microsoft-Open-Sources- Industry-Leading-Embedding-Model

  32. [32]

    Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel

  33. [33]

    InPro- ceedings of the 40th International Conference on Machine Learning

    Large Language Models Struggle to Learn Long-Tail Knowledge. InPro- ceedings of the 40th International Conference on Machine Learning. Honolulu, Hawaii, 15696–15707. https://dl.acm.org/doi/10.5555/3618408.3619049

  34. [34]

    Catherine Kosten, Philippe Cudré-Mauroux, and Kurt Stockinger. 2023. Spi- der4SPARQL: A Complex Benchmark for Evaluating Knowledge Graph Question Answering Systems. In2023 IEEE International Conference on Big Data (BigData). IEEE, 5272–5281. https://doi.org/10.1109/bigdata59044.2023.10386182

  35. [35]

    Liubov Kovriguina, Roman Teucher, Daniil Radyush, and Dmitry Mouromtsev

  36. [36]

    InProceedings of SEMANTiCS 2023

    SPARQLGEN: One-Shot Prompt-based Approach for SPARQL Query Generation. InProceedings of SEMANTiCS 2023. Leipzig, Germany. https://ceur- ws.org/Vol-3526/paper-08.pdf Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version)

  37. [37]

    Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent Retrieval for Weakly Supervised Open Domain Question Answering. InProcedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy. https://doi.org/10.18653/v1/P19-1612

  38. [38]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented gen- eration for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M...

  39. [39]

    Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Ma Chenhao, Guoliang Li, Kevin Chang, Fei Huang, Reynold Cheng, and Yongbin Li. 2023. Can LLM Already Serve as a Databaes Interface? A Big Bench for Large-Scale Database Grounded Text-to-SQLs. InProceedings of the 37th Conference on...

  40. [40]

    Teng Lin, Yuyu Luo, Honglin Zhang, Jicheng Zhang, Chunlin Liu, Kaishun Wu, and Nan Tang. 2025. MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China. https://doi.org/10.18653/v1/2025.emnlp-main.77

  41. [41]

    Shicheng Liu, Sina Semnani, Harold Triedman, Jialiang Xu, Isaac Dan Zhao, and Monica Lam. 2024. SPINACH: SPARQL-Based Information Navigation for Chal- lenging Real-World Questions. InFindings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics, Miami, FL, 15977–16001. https://doi.org/10.18653/v1/2024.fin...

  42. [42]

    Ozan Baris Mulayim, Avia Anwar, Umut Mete Saka, Lazlo Paul, Anand Krishnan Prakash, Gabe Fierro, Marco Pritoni, and Mario Bergés. 2025. BuildingQA: A Benchmark for Natural Language Question Answering over Building Knowledge Graphs. InProceedings of the 12th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation(...

  43. [43]

    Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. 2024. RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand. https://doi.org/10.18653/v1/2024.acl-long.585

  44. [44]

    Yone, Samyak Rajesh Jain, Namyong Park, Ryan A

    Anish Pahilajani, Devasha Trivedi, Jincen Shuai, Khin S. Yone, Samyak Rajesh Jain, Namyong Park, Ryan A. Rossi, Nesreen K. Ahmed, Franck Dernoncourt, and Yu Wang. 2024. GRS-QA – Graph Reasoning-Structured Question Answering Dataset. arXiv:2411.00369 [cs.CL] https://arxiv.org/abs/2411.00369

  45. [45]

    Bowman, and Shi Feng

    Arjun Panickssery, Samuel R. Bowman, and Shi Feng. 2024. LLM Evaluators Recognize and Favor Their Own Generations. arXiv:2404.13076 [cs.CL] https: //arxiv.org/abs/2404.13076

  46. [46]

    Chanhee Park, Hyeonseok Moon, Chanjun Park, and Heuiseok Lim. 2025. MI- RAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation. InFindings of the Association for Computational Linguistics: NAACL

  47. [47]

    https://aclanthology.org/2025.findings-naacl.157

    Association for Computational Linguistics, Albuquerque, New Mexico, 2883–2900. https://aclanthology.org/2025.findings-naacl.157

  48. [48]

    Miller, and Sebastian Riedel

    Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. 2019. Language Models as Knowledge Bases?. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Hong Kong, China, 2463–2473. https://doi.org/10.18653/ v1/D19-1250

  49. [49]

    Julian Schnitzler, Xanh Ho, Jiahao Huang, Florian Boudin, Saku Sugawara, and Akiko Aizawa. 2024. MoreHopQA: More Than Multi-hop Reasoning. arXiv:2406.13397 https://arxiv.org/abs/2406.13397

  50. [50]

    Tommaso Soru, Edgard Marx, Diego Moussallem, Gustavo Publio, André Valdes- tilhas, Diego Esteves, and Ciro Baron Neto. 2017. SPARQL as a Foreign Language. https://ceur-ws.org/Vol-2044/paper14/paper14.pdf. InProceedings of SEMAN- TiCS 2017. Amsterdam, Netherlands

  51. [51]

    Heydar Soudani, Evangelos Kanoulas, and Faegheh Hasibi. 2024. Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge. InSIGIR-AP 2024: Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. Tokyo, Japan. https://doi.org/10.1145/3673791.3698415

  52. [52]

    Jan Strich, Enes Kutay Isgorur, Maximilian Trescher, Chris Biemann, and Mar- tin Semmann. 2025. T 2-RAGBench: Text-and-Table Benchmark for Evaluating Retrieval-Augmented Generation. arXiv:2506.12071 https://arxiv.org/abs/2506. 12071

  53. [53]

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal

  54. [54]

    arXiv:2108.00573 https://arxiv.org/abs/2108.00573

    MuSiQue: Multihop Questions via Single-hop Question Composition. arXiv:2108.00573 https://arxiv.org/abs/2108.00573

  55. [55]

    VibrantLabs. 2024. Ragas: Supercharge Your LLM Application Evaluations. https: //github.com/vibrantlabsai/ragas

  56. [56]

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. 2024. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understand- ing Benchmark. arXiv:2406.01574 [cs.CL] https://arxiv.org/abs/2406.01574

  57. [57]

    Tomer Wolfson, Harsh Trivedi, Mor Geva, Yoav Goldberg, Dan Roth, Tushar Khot, Ashish Sabharwal, and Reut Tsarfaty. 2026. MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents.Transactions of the Association for Computational Linguistics14 (Jan. 2026), 23–46. https: //doi.org/10.1162/TACL.a.64

  58. [58]

    Ioannidis, Karthik Subbian, James Zou, and Jure Leskovec

    Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vassilis N. Ioannidis, Karthik Subbian, James Zou, and Jure Leskovec. 2024. STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, an...

  59. [59]

    Yilin Xiao, Junnan Dong, Chuang Zhou, Su Dong, Qian wen Zhang, Di Yin, Xing Sun, and Xiao Huang. 2025. GraphRAG-Bench: Challenging Domain- Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation. arXiv:2506.02404 [cs.CL] https://arxiv.org/abs/2506.02404

  60. [60]

    Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sa- jal Choudhary, Rongze Daniel Gui, Ziran Will Jiang, Ziyu Jiang, Lingkun Kong, Brian Moran, Jiaqi Wang, Yifan Ethan Xu, An Yan, Chenyu Yang, Et- ing Yuan, Hanwen Zha, Nan Tang, Lei Chen, Nicolas Scheffer, Yue Liu, Nirav Shah, Rakesh Wanga, Anuj Kumar, Wen-tau Yih, and Xin Luna Dong. ...

  61. [61]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Rus- lan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium. https://doi.org/10.18653/v1/D18-1259

  62. [62]

    Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. 2025. Evaluation of Retrieval-Augmented Generation: A Survey. Springer Nature Singa- pore, 102–120. http://dx.doi.org/10.1007/978-981-96-1024-2_8

  63. [63]

    Yixiao Zeng, Tianyu Cao, Danqing Wang, Xinran Zhao, Zimeng Qiu, Morteza Ziyadi, Tongshuang Wu, and Lei Li. 2025. RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems. https://doi.org/10. 48550/arXiv.2506.00789 arXiv:2506.00789

  64. [64]

    Weinberger, and Yoav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Killian Q. Weinberger, and Yoav Artzi

  65. [65]

    ", t IN texts | CASE WHEN s =

    BERTScore: Evaluating Text Generation with BERT. InProceedings of the 2020 International Conference on Learning Representations. Addis Ababa, Ethiopia. https://openreview.net/pdf?id=SkeHuCVFDr A BENCHMARK DATASET There are three main advantages to using an external knowledge base as a backend for information retrieval to an LLM. Firstly, an LLM is of a se...

  66. [66]

    A concise answer or list of answers with no additional text,

  67. [67]

    article:1234:s8:p2:c0

    A concise explanation of how you arrived at the answer. If you don't know the answer to the question, return the answer as'unknown'and in the explanation describe why you can't answer the question. This is a test so you can't ask any clarifying questions, so might need to make assumptions. If you make an assumption provide in the explanation a concise sta...

  68. [68]

    start by using the'vector_search_article' tool to find relevant articles

  69. [69]

    use the'get_section_titles_and_infoboxes' tool to read the infoboxes and find the relevant sections of those articles

  70. [70]

    Answer:␣{agent_answer}\ nExplanation:␣{explanation}

    use'get_sections'to retrieve the text and tables of the relevant sections. Sometimes the required information might be in an infobox so you might not need to read any sections. Some sections might be empty as they contained links that aren't in the snapshot, but you can use the'get_backlinks' tool to find other articles that link to the relevant article. ...

  71. [71]

    If the model returns'unknown'or says that it couldn't answer the question or it doesn't have enough information to answer the question, then you must return 0

  72. [72]

    If the model makes a prediction, rather than saying it doesn't know, but the prediction does not match any of the provided answers from the Ground Truth Answer list then the prediction is wrong and you must return -1

  73. [73]

    If the model prediction matches all provided answers from the Ground Truth Answer list then the prediction is fully correct and you must return +1

  74. [74]

    Only if the prediction does not include any additional incorrect answers, then you must return +0.5

    If the model prediction matches a subset of the provided answers from the Ground Truth Answer list but some correct answers are missing, then the prediction is partially correct. Only if the prediction does not include any additional incorrect answers, then you must return +0.5

  75. [75]

    The question is {user_input}, the model prediction and explanation are {combined_answer}, and the Ground Truth answers are {reference}

    If the model prediction includes some correct answers from the Ground Truth Answer list but also includes any incorrect answers (answers not in the Ground Truth Answer list), then model is incorrect, and you must return -0.5. The question is {user_input}, the model prediction and explanation are {combined_answer}, and the Ground Truth answers are {referen...