arxiv: 2604.23515 · v1 · submitted 2026-04-26 · 📊 stat.CO

Recognition: unknown

ragR: Retrieval-Augmented Generation and RAG Assessment in R

Chi-Kuang Yeh, Muhammad Aimal Rehman, Zhili Lu

Pith reviewed 2026-05-08 04:58 UTC · model grok-4.3

classification 📊 stat.CO

keywords ragRRAGRAGASretrieval-augmented generationR packageLLM evaluationcontext precisionfaithfulness

0 comments

The pith

The ragR package unifies RAG construction and evaluation in R with metric behavior matching Python RAGAS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces ragR, an R package that handles the full retrieval-augmented generation process from document ingestion through embedding, retrieval, generation, and structured logging. It adds LLM-based scoring for the four core RAGAS metrics of context precision, context recall, faithfulness, and answer relevance. Controlled validation experiments show that the R implementation produces metric values similar to the established Python RAGAS workflow across several use cases. A reader would care because this setup lets R users build, run, and assess RAG systems without switching languages, supporting reproducible work in statistical computing.

Core claim

The authors create ragR to combine document ingestion, embedding and vector storage, similarity-based retrieval, grounded generation, question-answer logging, and RAGAS-style evaluation with LLM scoring for context precision, context recall, faithfulness, and answer relevance all inside R. Validation under controlled settings demonstrates that ragR captures similar metric behavior to the reference Python RAGAS workflow across multiple use cases.

What carries the argument

The ragR package, which supplies an R-native workflow for the complete RAG pipeline plus LLM-based scoring of the four RAGAS metrics.

If this is right

R users can perform end-to-end RAG work without Python dependencies.
Reproducible RAG experiments and teaching become possible inside the R ecosystem.
Moderate-scale RAG testing can use only R tools and libraries.
Structured logging of question-answer pairs supports further analysis in R.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

R's statistical tools could now be applied directly to analyze RAG performance distributions.
The package could be extended to compare different embedding or retrieval methods using R's existing data analysis functions.
This lowers the barrier for statisticians to test RAG ideas without adopting a second programming language.

Load-bearing premise

That the R implementation of LLM-based scoring for the four RAGAS metrics produces results comparable to the Python reference without systematic differences from language-specific libraries or random seeds.

What would settle it

A side-by-side run of the same documents, queries, models, and prompts in both ragR and Python RAGAS that checks whether the four metric scores differ beyond ordinary random variation.

Figures

Figures reproduced from arXiv: 2604.23515 by Chi-Kuang Yeh, Muhammad Aimal Rehman, Zhili Lu.

**Figure 1.** Figure 1: Generic RAG pipeline: A user query is embedded, relevant context is retrieved from a vector store, and the retrieved context is added to the prompt for answer generation. Despite growing interest in LLM applications within the R community, integrated support for both RAG construction and RAG assessment (RAGAS) remains limited in R. Recent packages such as ragnar (Kalinowski and Falbel, 2026) and RAGFlowCh… view at source ↗

**Figure 2.** Figure 2: Complete architecture of the ragR framework. Documents are processed during ingestion, embedded, and stored in an RDS-backed vector store. User queries are embedded and used for similarity-based retrieval over a selected collection. The retrieved context is incorporated into the prompt for language model generation. Interactions are stored in QA logs and subsequently evaluated by the RAGAS module. into the… view at source ↗

**Figure 3.** Figure 3: summarizes this workflow and the resulting storage structure. In the upper part, each input document is segmented into chunks, embedded, and written to the vector store. In the lower part, the figure shows the main fields stored for each record, including the collection name, chunk identifier, text content, embedding vector, and source metadata view at source ↗

**Figure 4.** Figure 4: RAG module workflow in ragR: query embedding, context retrieval from the vector store, prompt construction, grounded generation, and interaction logging. temperature = 0, max_output_tokens = 2000L, score_threshold = 0, system_prompt = "You are a helpful assistant." ) This interface exposes the main retrieval and generation controls. The question argument supplies the user query, and collection selects the… view at source ↗

**Figure 5.** Figure 5: Course Syllabus Application: comparison of ragR and Python RAGAS across chunk sizes for Context Precision, Context Recall, Faithfulness, and Answer Relevance view at source ↗

**Figure 6.** Figure 6: Course Syllabus Application: comparison of ragR and Python RAGAS across chunk sizes for RAGAS Overall view at source ↗

**Figure 7.** Figure 7: USMLE Anatomy Assistant: comparison of ragR and Python RAGAS across chunk sizes for RAGAS Overall view at source ↗

**Figure 8.** Figure 8: Policy Brief Assistant: comparison of ragR and Python RAGAS across chunk sizes for RAGAS Overall view at source ↗

read the original abstract

Retrieval-augmented generation (RAG) combines document retrieval with large language models to produce responses grounded in external evidence. While several R packages support core components of RAG workflows, integrated evaluation of RAG systems in R remains limited and is often conducted through Python-based tools, most notably the RAG assessment (RAGAS) framework. To address this gap, we introduce ragR, an R package that unifies document ingestion, embedding and vector storage, similarity-based retrieval, grounded generation, structured question-answer logging, and RAGAS-style evaluation within a single R-native workflow. The current implementation provides LLM-based scoring for four core RAGAS metrics: context precision, context recall, faithfulness, and answer relevance. Validation experiments under controlled settings show that ragR captures similar metric behavior to the reference Python RAGAS workflow across multiple use cases. By integrating RAG construction and evaluation within a reproducible workflow in R, ragR provides a practical framework for research, teaching, and moderate-scale experimentation on RAG systems entirely within the R ecosystem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ragR ports RAG construction and RAGAS-style evaluation into a single R package, which is useful for R users but rests on thin validation for the key claim of metric equivalence.

read the letter

The core contribution is a new R package that pulls together document handling, embeddings, retrieval, generation, and LLM-based scoring for the four standard RAGAS metrics all inside R. That removes the need to jump to Python for evaluation when the rest of the workflow is already in R, which is a practical win for computational statisticians who prefer to stay native. The implementation appears straightforward and the abstract positions it as a self-contained workflow rather than a new algorithm. That part lands cleanly. The validation section is the weaker part. It reports that ragR produces similar metric values to the Python RAGAS reference under controlled settings, but supplies no numbers on agreement, no list of datasets or models, and no confirmation that temperature, seeds, prompt serialization, and API calls were locked down identically across the two systems. Without those controls, any observed similarity could reflect loose rather than tight equivalence. The stress-test concern about language-specific differences in LLM scoring is therefore reasonable and not fully addressed by the abstract. This paper is aimed at R practitioners who want to run moderate-scale RAG experiments without leaving their environment. It is not a methodological advance, but the software artifact itself is new and could be cited by people who adopt the package. I would send it to peer review with a request for more explicit validation details and reproducibility materials; the work is coherent on its own terms and the implementation focus is honest.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces the ragR R package, which unifies document ingestion, embedding, vector storage, similarity-based retrieval, grounded generation, question-answer logging, and LLM-based scoring for four RAGAS metrics (context precision, context recall, faithfulness, and answer relevance) into a single R-native workflow. The central claim is that validation experiments under controlled settings demonstrate that ragR captures similar metric behavior to the reference Python RAGAS implementation across multiple use cases.

Significance. If the equivalence claim holds, ragR would provide a practical contribution by enabling fully reproducible RAG construction and evaluation within the R ecosystem, reducing the need to switch to Python for assessment and supporting teaching and moderate-scale research. The integration of all components in one package is a clear strength for workflow reproducibility.

major comments (2)

[Abstract] Abstract: The claim that 'validation experiments under controlled settings show that ragR captures similar metric behavior to the reference Python RAGAS workflow' lacks any details on datasets, LLMs/endpoints, temperature/seed settings, prompt templates, or quantitative agreement measures (e.g., correlations or mean differences). This directly undermines the ability to confirm that observed similarity is not due to language-specific differences in prompt serialization, API parsing, or sampling, as the central equivalence claim requires matched conditions.
[Validation experiments] Validation experiments section (or equivalent): No experimental setup, results tables, or statistical comparisons are described, leaving the supporting evidence for the key claim of metric equivalence unsubstantiated and preventing assessment of whether R-specific libraries introduce systematic shifts relative to Python RAGAS.

minor comments (3)

[Abstract] The abstract should include a citation to the original RAGAS framework paper to properly attribute the four metrics being re-implemented.
[Introduction] Provide explicit installation instructions, GitHub/CRAN link, and at least one complete reproducible example workflow in the main text or supplementary material to support the claim of a practical R-native framework.
[Methods] Clarify in the methods whether the LLM calls use a specific R package (e.g., httr, reticulate) and how prompt templates are ensured to match the Python reference exactly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. We agree that the validation experiments require substantially more detail to substantiate the central claim of metric equivalence between ragR and the Python RAGAS reference implementation. We will revise the manuscript to address both comments by expanding the abstract and adding a dedicated validation section with full experimental specifications, quantitative results, and statistical comparisons.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'validation experiments under controlled settings show that ragR captures similar metric behavior to the reference Python RAGAS workflow' lacks any details on datasets, LLMs/endpoints, temperature/seed settings, prompt templates, or quantitative agreement measures (e.g., correlations or mean differences). This directly undermines the ability to confirm that observed similarity is not due to language-specific differences in prompt serialization, API parsing, or sampling, as the central equivalence claim requires matched conditions.

Authors: We acknowledge that the abstract statement is insufficiently supported. In the revised version we will modify the abstract to include a concise summary of the controlled experimental conditions: the specific datasets (standard RAGAS benchmark QA pairs drawn from public sources), the LLMs and API endpoints used for both generation and scoring, temperature set to 0 with fixed seeds for reproducibility, the prompt templates employed, and quantitative agreement statistics (Pearson and Spearman correlations plus mean absolute differences between ragR and Python RAGAS metric scores). This will allow readers to evaluate whether any observed similarity could be attributable to implementation differences. revision: yes
Referee: [Validation experiments] Validation experiments section (or equivalent): No experimental setup, results tables, or statistical comparisons are described, leaving the supporting evidence for the key claim of metric equivalence unsubstantiated and preventing assessment of whether R-specific libraries introduce systematic shifts relative to Python RAGAS.

Authors: We agree that the manuscript currently lacks a dedicated validation section with the required methodological transparency. We will add a new section titled 'Validation Experiments' that details: (i) the datasets and number of QA pairs, (ii) the exact LLMs, endpoints, and temperature/seed settings for both ragR and the reference Python implementation, (iii) the prompt templates used for each RAGAS metric, (iv) side-by-side results tables reporting context precision, context recall, faithfulness, and answer relevance scores from both packages, and (v) statistical comparisons including correlation coefficients and mean differences. This addition will directly address concerns about potential systematic shifts introduced by R-specific libraries and will make the equivalence claim verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: software implementation paper with empirical comparison to external reference

full rationale

The paper introduces an R package for RAG workflows and RAGAS-style evaluation. Its central claim is that validation experiments show similar metric behavior to the Python RAGAS reference under controlled settings. No derivations, equations, fitted parameters, predictions, or self-citations are present in the provided text. The validation is a direct empirical comparison to an independent external implementation (Python RAGAS), not a reduction to the paper's own inputs or fitted quantities. This matches the default expectation for non-circular software/implementation papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software package introduction paper; the central claim rests on standard RAG components and the existing RAGAS metric definitions rather than any new theoretical constructs.

pith-pipeline@v0.9.0 · 5484 in / 1036 out tokens · 37939 ms · 2026-05-08T04:58:55.001043+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages · 1 internal anchor

[1]

RAGAS: Automated evalu- ation of retrieval-augmented generation.arXiv preprint arXiv:2309.15217,

URL https://www.rplumber.io/. R package version. [p7] S. Es, J. James, L. Espinosa-Anke, and S. Schockaert. Automated evaluation of retrieval augmented generation.arXiv preprint arXiv:2309.15217,

work page arXiv
[2]

Retrieval-Augmented Generation for Large Language Models: A Survey

URL https://docs.ragas.io/en/ stable/. Accessed 2026-02-22. [p1, 2, 7, 9, 12, 13] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, and H. Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997,

work page internal anchor Pith review arXiv 2026
[3]

R package version 0.1.7

URL https://CRAN.R-project.org/package= RAGFlowChainR. R package version 0.1.7. [p1, 2, 13] Package source code.The development version of ragR is available at https://github. com/aimalrehman92/ragR. Muhammad Aimal Rehman Department of Mathematics and Statistics Georgia State University 25 Park Place, Atlanta, GA 30303 United States of America mrehman3@st...

2096