{"total":12,"items":[{"citing_arxiv_id":"2606.06748","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evidence Graph Consistency in Retrieval-Augmented Generation: A Model-Dependent Analysis of Hallucination Detection","primary_cat":"cs.CL","submitted_at":"2026-06-04T22:19:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EGC reveals that graph consistency measures align with hallucinations in Llama-2 but reverse direction in GPT-4, GPT-3.5 and Mistral-7B on the RAGTruth QA split, indicating model-family-specific hallucination patterns.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27494","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?","primary_cat":"cs.CR","submitted_at":"2026-05-26T16:50:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GroundedCache reduces unsafe-served rate in RAG answer caching to 0-1.5% (vs 15-51.5% naive) via four validation gates while keeping p50 latency within 1.07x of no-cache baseline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00093","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why","primary_cat":"cs.CL","submitted_at":"2026-05-25T07:31:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05245","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation","primary_cat":"cs.CL","submitted_at":"2026-05-04T14:45:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AdaGATE improves evidence F1 scores on HotpotQA for multi-hop RAG under clean, redundant, and noisy conditions by framing selection as gap-aware token-constrained repair, outperforming baselines while using 2.6x fewer tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00957","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"\"I Don't Know\" -- Towards Appropriate Trust with Certainty-Aware Retrieval Augmented Generation","primary_cat":"cs.IR","submitted_at":"2026-05-01T12:49:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"CERTA adds relevance-based certainty estimation to RAG so LLMs can better signal uncertainty on non-objective questions, reducing overconfidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18234","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies","primary_cat":"cs.IR","submitted_at":"2026-04-20T13:20:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"LLM to process the retrieved data together with the input query to generate the final response. Evaluating all three components is essential for assessing the performance of a RAG system and ensuring it produces accurate, well-grounded responses: While the generator component has been extensively studied-with approaches such as RAGAS [11] and Ares [23] proposing methodologies to as- sess the quality of RAG-generated responses-evaluation of the retrieval and indexing components has received comparatively less attention [6]. In partic- ular, assessment of indexing methods often relies on system-level performance metrics such as throughput and latency [14]. Our focus is on retriever evaluation, which assesses the relevance of the con-"},{"citing_arxiv_id":"2604.16310","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation","primary_cat":"cs.IR","submitted_at":"2026-01-30T15:32:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RAG-DIVE uses an LLM to dynamically generate, validate, and evaluate multi-turn dialogues for assessing RAG system performance in interactive settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.07738","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Automated Construction of a Knowledge Graph of Nuclear Fusion Energy for Effective Elicitation and Retrieval of Information","primary_cat":"cs.CL","submitted_at":"2025-04-10T13:29:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A multi-step LLM-based pipeline constructs the first knowledge graph for nuclear fusion energy and enables RAG for multi-hop queries.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.05579","ref_index":194,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods","primary_cat":"cs.CL","submitted_at":"2024-12-07T08:07:24+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"[86], Tessler et al. [214], Sotopia [298] META-EVALUATION(§6) Benchmarks (§6.1) Code Generation(§6.1.1) HumanEval [30], SWEBench [101], DevAI [303], CrossCodeEval [50], CodeUltraFeedback [243] Machine Translation(§6.1.2) Freitag et al. [66], Literary Translation Comparisons [104], MQM [65] Text Summarization(§6.1.3) SummEval [59], FRANK [169], OpinsummEval [194] Dialogue Generation(§6.1.4) Topical-Chat [72], PERSONA-CHAT [283], Mehri et al. [156], DSTC10 Track 5 [272, 279] Automatic StoryGeneration (§6.1.5)HANNA [34], MANS [73], OpenMEVA [73], StoryER [26], PERSER [229] Values Alignment(§6.1.6) PKU-SafeRLHF [96], HHH [6], CVALUES [255] Recommendation(§6.1.7) MovieLens [80], Zhang et al. [284], Yelp [4] Search (§6."},{"citing_arxiv_id":"2404.10981","ref_index":120,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Retrieval-Augmented Text Generation for Large Language Models","primary_cat":"cs.IR","submitted_at":"2024-04-17T01:27:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey that categorizes RAG methods for LLMs into four retrieval-centric stages, reviews their evolution and evaluation, and outlines challenges and future directions.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"across different modalities are required to produce outputs that are both contextually relevant and visually coherent. Recent advancements in multimodal RAG, such as MuRAG [17], REVEAL [49], and Re-ViLM [152], have shown potential in incorporating multimodal retrieval and generation into real-world applications like visual question answering [18], image captioning [120], and text-to-audio gen- eration [158]. Moving forward, research will likely focus on refining these techniques, especially in scaling multimodal retrieval to handle larger datasets and more complex queries. Extending retrieval capabilities to include more diverse media types, such as video and speech, also represents a promising direction for the continued evolution of RAG systems."},{"citing_arxiv_id":"2401.15391","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries","primary_cat":"cs.CL","submitted_at":"2024-01-27T11:41:48+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.10997","ref_index":165,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Retrieval-Augmented Generation for Large Language Models: A Survey","primary_cat":"cs.CL","submitted_at":"2023-12-18T07:47:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"quantitative metrics that not only gauge RAG model perfor- mance but also enhance comprehension of the model's capabil- ities across various evaluation aspects. Prominent benchmarks such as RGB, RECALL and CRUD [167]-[169] focus on appraising the essential abilities of RAG models. Concur- rently, state-of-the-art automated tools like RAGAS [164], ARES [165], and TruLens 8 employ LLMs to adjudicate the quality scores. These tools and benchmarks collectively form a robust framework for the systematic evaluation of RAG models, as summarized in Table IV. VII. D ISCUSSION AND FUTURE PROSPECTS Despite the considerable progress in RAG technology, sev- eral challenges persist that warrant in-depth research.This"}],"limit":50,"offset":0}