hub

Ares: An automated evaluation framework for retrieval-augmented generation systems

· 2023 · arXiv 2311.09476

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 1 method 1

citation-polarity summary

background 2 use dataset 1 use method 1

representative citing papers

AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

AdaGATE improves evidence F1 scores on HotpotQA for multi-hop RAG under clean, redundant, and noisy conditions by framing selection as gap-aware token-constrained repair, outperforming baselines while using 2.6x fewer tokens.

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

cs.CL · 2024-01-27 · accept · novelty 7.0

MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.

Evidence Graph Consistency in Retrieval-Augmented Generation: A Model-Dependent Analysis of Hallucination Detection

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

EGC reveals that graph consistency measures align with hallucinations in Llama-2 but reverse direction in GPT-4, GPT-3.5 and Mistral-7B on the RAGTruth QA split, indicating model-family-specific hallucination patterns.

Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

cs.CR · 2026-05-26 · unverdicted · novelty 6.0

GroundedCache reduces unsafe-served rate in RAG answer caching to 0-1.5% (vs 15-51.5% naive) via four validation gates while keeping p50 latency within 1.07x of no-cache baseline.

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

cs.CL · 2026-05-25 · conditional · novelty 6.0

For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies

cs.IR · 2026-04-20 · unverdicted · novelty 6.0

CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.

RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation

cs.IR · 2026-01-30 · unverdicted · novelty 5.0

RAG-DIVE uses an LLM to dynamically generate, validate, and evaluate multi-turn dialogues for assessing RAG system performance in interactive settings.

"I Don't Know" -- Towards Appropriate Trust with Certainty-Aware Retrieval Augmented Generation

cs.IR · 2026-05-01 · unverdicted · novelty 4.0

CERTA adds relevance-based certainty estimation to RAG so LLMs can better signal uncertainty on non-objective questions, reducing overconfidence.

Automated Construction of a Knowledge Graph of Nuclear Fusion Energy for Effective Elicitation and Retrieval of Information

cs.CL · 2025-04-10 · unverdicted · novelty 4.0

A multi-step LLM-based pipeline constructs the first knowledge graph for nuclear fusion energy and enables RAG for multi-hop queries.

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

cs.CL · 2024-12-07 · accept · novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Retrieval-Augmented Generation for Large Language Models: A Survey

cs.CL · 2023-12-18 · unverdicted · novelty 3.0

A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.

A Survey on Retrieval-Augmented Text Generation for Large Language Models

cs.IR · 2024-04-17 · unverdicted · novelty 2.0

A survey that categorizes RAG methods for LLMs into four retrieval-centric stages, reviews their evolution and evaluation, and outlines challenges and future directions.

citing papers explorer

Showing 12 of 12 citing papers.

AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation cs.CL · 2026-05-04 · unverdicted · none · ref 25
AdaGATE improves evidence F1 scores on HotpotQA for multi-hop RAG under clean, redundant, and noisy conditions by framing selection as gap-aware token-constrained repair, outperforming baselines while using 2.6x fewer tokens.
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries cs.CL · 2024-01-27 · accept · none · ref 19
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.
Evidence Graph Consistency in Retrieval-Augmented Generation: A Model-Dependent Analysis of Hallucination Detection cs.CL · 2026-06-04 · unverdicted · none · ref 25
EGC reveals that graph consistency measures align with hallucinations in Llama-2 but reverse direction in GPT-4, GPT-3.5 and Mistral-7B on the RAGTruth QA split, indicating model-family-specific hallucination patterns.
Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer? cs.CR · 2026-05-26 · unverdicted · none · ref 23
GroundedCache reduces unsafe-served rate in RAG answer caching to 0-1.5% (vs 15-51.5% naive) via four validation gates while keeping p50 latency within 1.07x of no-cache baseline.
Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why cs.CL · 2026-05-25 · conditional · none · ref 33
For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies cs.IR · 2026-04-20 · unverdicted · none · ref 23
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation cs.IR · 2026-01-30 · unverdicted · none · ref 28
RAG-DIVE uses an LLM to dynamically generate, validate, and evaluate multi-turn dialogues for assessing RAG system performance in interactive settings.
"I Don't Know" -- Towards Appropriate Trust with Certainty-Aware Retrieval Augmented Generation cs.IR · 2026-05-01 · unverdicted · none · ref 37
CERTA adds relevance-based certainty estimation to RAG so LLMs can better signal uncertainty on non-objective questions, reducing overconfidence.
Automated Construction of a Knowledge Graph of Nuclear Fusion Energy for Effective Elicitation and Retrieval of Information cs.CL · 2025-04-10 · unverdicted · none · ref 28
A multi-step LLM-based pipeline constructs the first knowledge graph for nuclear fusion energy and enables RAG for multi-hop queries.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods cs.CL · 2024-12-07 · accept · none · ref 194
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Retrieval-Augmented Generation for Large Language Models: A Survey cs.CL · 2023-12-18 · unverdicted · none · ref 165
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.
A Survey on Retrieval-Augmented Text Generation for Large Language Models cs.IR · 2024-04-17 · unverdicted · none · ref 120
A survey that categorizes RAG methods for LLMs into four retrieval-centric stages, reviews their evolution and evaluation, and outlines challenges and future directions.

Ares: An automated evaluation framework for retrieval-augmented generation systems

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer