EnterpriseDocBench shows hybrid retrieval edges out BM25 and dense embeddings in end-to-end document pipelines, with weak inter-stage correlations and a gap between 85.5% factual accuracy and 0.40 average completeness.
hub Mixed citations
Ragas: Automated Evaluation of Retrieval Augmented Generation
Mixed citation behavior. Most common role is background (50%).
abstract
We introduce Ragas (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With Ragas, we put forward a suite of metrics which can be used to evaluate these different dimensions \textit{without having to rely on ground truth human annotations}. We posit that such a framework can crucially contribute to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.
hub tools
citation-role summary
citation-polarity summary
roles
background 5representative citing papers
GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.
DotRAG reformulates graph retrieval as query-guided path reasoning with Division of Thought, reporting SOTA results on MetaQA and UltraDomain for multi-hop tasks.
StratRAG is a new benchmark dataset for multi-hop retrieval in RAG systems with noisy document pools, where hybrid retrieval reaches Recall@2 of 0.70 but bridge questions remain harder at 0.67.
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.
A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
W-RAC decouples extraction from semantic planning via structured units and LLM grouping to match traditional retrieval performance at roughly 10x lower LLM token cost.
Introduces a 93-question multimodal RAG benchmark with phrase-level recall and embedding-based hallucination metrics, finding closed-source pipelines outperform open-source ones especially on cross-modal and cross-document tasks.
GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.
CRAFT introduces a query-conditioned pipeline with dynamic keyframe selection, ASR, and a hybrid critic loop that achieves top scores on MAGMaR 2026 for grounded multi-video question answering.
RAG-DIVE uses an LLM to dynamically generate, validate, and evaluate multi-turn dialogues for assessing RAG system performance in interactive settings.
IoDResearch is a private data-centric Deep Research framework that uses FAIR digital objects, atomic knowledge units, heterogeneous graph indexes, and a multi-agent system to outperform standard RAG baselines on retrieval, QA, and report generation tasks.
Proposes an intent-RAG framework that combines RAG, machine reasoning, and generative AI to interpret application intents and generate network intents, outperforming LLMs and vanilla RAG in translation tasks.
Deepchecks is a new multi-faceted evaluation framework for RAG that incorporates root cause analysis and production monitoring to assess reliability, relevance, and user satisfaction.
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
ragR provides a unified R-native workflow for constructing retrieval-augmented generation systems and evaluating them with LLM-scored RAGAS metrics.
The survey unifies LLM augmentation techniques along the single axis of structured context supplied at inference time and supplies a literature screening protocol plus deployment decision framework.
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.
A survey that categorizes RAG methods for LLMs into four retrieval-centric stages, reviews their evolution and evaluation, and outlines challenges and future directions.
citing papers explorer
-
ragR: Retrieval-Augmented Generation and RAG Assessment in R
ragR provides a unified R-native workflow for constructing retrieval-augmented generation systems and evaluating them with LLM-scored RAGAS metrics.