mamabench and mamaretrieval: Benchmarks for Evaluating Medical Retrieval-Augmented Generation in Maternal, Neonatal, and Reproductive Health

Yi Ren

arxiv: 2606.29467 · v1 · pith:VXCN43UCnew · submitted 2026-06-28 · 💻 cs.CL · cs.IR

mamabench and mamaretrieval: Benchmarks for Evaluating Medical Retrieval-Augmented Generation in Maternal, Neonatal, and Reproductive Health

Yi Ren This is my paper

Pith reviewed 2026-06-30 07:36 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords maternal healthretrieval-augmented generationmedical benchmarksquestion answeringrelevance gradingclinical guidelinesnurse-midwife queries

0 comments

The pith

Two benchmarks assembled from expert sources enable evaluation of retrieval-augmented generation for maternal, neonatal, and reproductive health.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases mamabench, a set of 25,949 scope-filtered questions drawn from seven existing expert sources, and mamaretrieval, a collection of 3,185 queries paired with 0-6 graded relevance labels over 63,650 guideline chunks. These resources address the absence of public benchmarks covering the specific questions nurse-midwives ask and the lack of chunk-level graded relevance data for maternal-health retrieval. The construction choices include filtering expert material rather than writing new items, using continuous relevance grades instead of binary labels, and reporting label quality metrics such as scope-classifier agreement and pooling audits. A sympathetic reader would see this as supplying the missing test beds needed to measure how well language models and retrieval systems perform on this clinical domain.

Core claim

By assembling and scope-filtering existing expert-authored sources into mamabench for question answering and creating mamaretrieval with decomposed graded relevance labels over a maternal-health guideline corpus, the benchmarks provide the first public resources for evaluating retrieval-augmented generation systems on the maternal, neonatal, and reproductive health questions that nurse-midwives actually encounter, while explicitly disclosing the limits of those labels.

What carries the argument

mamabench, a 25,949-item QA set filtered from seven expert sources across multiple-choice, short-answer, and rubric tracks, together with mamaretrieval, 3,185 queries paired with 0-6 graded relevance labels over 63,650 chunks using a rubric that separates answer-providing chunks from merely topical ones.

If this is right

LLM judges for rubric-graded answers can be calibrated using the re-scoped physician-labelled meta-evaluation from HealthBench.
Deployed on-device maternal-health assistants can be evaluated end-to-end against both QA accuracy and retrieval quality.
Retrieval systems can be trained and measured on continuous relevance grades rather than binary decisions.
Future benchmark creators can adopt the same practice of reporting scope-classifier agreement, frontier-judge checks, and pooling-completeness audits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The graded-relevance approach could be extended to other medical guideline corpora where topic overlap is common but direct answers are rare.
Similar assembly from expert sources might reduce the cost of creating domain-specific benchmarks in other clinical areas that already have published guidelines.
The explicit disclosure of label limits provides a template that could raise standards for how medical AI evaluation resources are released.

Load-bearing premise

The seven existing expert-authored sources, after scope filtering, adequately represent the range of maternal, neonatal, and reproductive-health questions that nurse-midwives actually ask in practice.

What would settle it

A survey or log of real nurse-midwife clinical queries in which more than a small fraction fall outside the topics covered by the seven filtered sources.

read the original abstract

Medical question-answering benchmarks rarely cover the maternal, neonatal, child, and reproductive-health questions a nurse-midwife asks, and, to our knowledge, no public chunk-level relevance benchmark exists for maternal-health guideline retrieval. We release two benchmarks that fill these gaps. mamabench is a scope-filtered QA set of 25,949 items assembled from seven existing expert-authored sources across multiple-choice, short-answer, and rubric-graded tracks; to help users calibrate the LLM judge that scores the rubric track, we re-scope HealthBench's physician-labelled meta-evaluation to the domain. mamaretrieval pairs 3,185 clinical queries with graded (0-6) relevance labels over a 63,650-chunk maternal-health guideline corpus, using a decomposed rubric that distinguishes a chunk that answers a query from one merely on its topic. Three decisions shape both: assemble and filter expert sources rather than author questions, grade relevance rather than binarise it, and measure and disclose the limits of the labels -- scope-classifier agreement, a frontier-judge check, and a pooling-completeness audit -- rather than treat them as an oracle. A companion paper uses the benchmarks to evaluate a deployed on-device assistant; both are released openly for research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper releases two filtered benchmarks for maternal health QA and retrieval using existing expert sources with graded labels and disclosed quality metrics, but leaves the key representativeness claim untested.

read the letter

The core thing to know is that this work releases mamabench (25,949 scope-filtered QA items from seven expert sources) and mamaretrieval (3,185 queries with 0-6 graded relevance over a 63k-chunk guideline corpus). They chose to filter and reuse existing materials rather than write new questions, kept relevance graded instead of binary, and reported some label-quality numbers like scope-classifier agreement and a pooling audit.

What the paper does cleanly is make the construction choices explicit and add a domain-scoped version of HealthBench's meta-evaluation to help users calibrate LLM judges on the rubric track. Releasing both the data and the companion evaluation paper openly is also straightforward and useful for anyone who wants to test RAG systems in this area.

The soft spot is the representativeness assumption. The claim that these resources fill the gap for nurse-midwife questions rests on the filtered expert sources being a good match for real practice, yet the paper gives no comparison to actual query logs, usage data, or practitioner surveys. They disclose other limits, but this one is load-bearing and unaddressed. Minor issues like the abstract-only view of the audits do not change the picture.

This is for researchers building or evaluating medical RAG systems who need domain-specific test sets in maternal health. A reader working on retrieval benchmarks or clinical QA could extract value by running their models on the released data. The work shows clear thinking about benchmark construction trade-offs and honest disclosure of limits, so it deserves peer review even if downstream users will ultimately decide how well the sources match their needs.

Referee Report

1 major / 0 minor

Summary. The paper releases two benchmarks for medical retrieval-augmented generation focused on maternal, neonatal, and reproductive health. mamabench is a scope-filtered QA dataset of 25,949 items drawn from seven existing expert-authored sources across multiple-choice, short-answer, and rubric-graded tracks (with a re-scoped HealthBench meta-evaluation for the rubric track). mamaretrieval pairs 3,185 clinical queries with graded (0-6) relevance labels over a 63,650-chunk guideline corpus. The work emphasizes assembling and filtering expert sources rather than authoring new questions, using graded rather than binary relevance, and disclosing label limits such as scope-classifier agreement, frontier-judge checks, and pooling-completeness audits.

Significance. If the filtered sources prove representative of real clinical questions in the target setting, the benchmarks would address a documented gap in domain-specific resources for maternal-health QA and chunk-level retrieval evaluation. The transparent disclosure of label-quality metrics and the reuse of expert sources are methodological strengths that increase the resources' utility for downstream LLM and RAG assessment; the companion paper on a deployed on-device assistant further demonstrates practical relevance.

major comments (1)

[Abstract and source-selection description] The claim that the benchmarks fill the stated gaps in coverage for the questions a nurse-midwife asks depends on the representativeness of the seven scope-filtered expert sources. The manuscript reports no comparison of the resulting 25,949 QA items or 3,185 queries against actual clinical query logs, practitioner surveys, or usage data from maternal-health settings (Abstract; source-selection description). This assumption is load-bearing for the central contribution and remains untested.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the review and for identifying the load-bearing assumption regarding source representativeness. We respond to the single major comment below.

read point-by-point responses

Referee: [Abstract and source-selection description] The claim that the benchmarks fill the stated gaps in coverage for the questions a nurse-midwife asks depends on the representativeness of the seven scope-filtered expert sources. The manuscript reports no comparison of the resulting 25,949 QA items or 3,185 queries against actual clinical query logs, practitioner surveys, or usage data from maternal-health settings (Abstract; source-selection description). This assumption is load-bearing for the central contribution and remains untested.

Authors: We agree that a direct comparison against clinical query logs, practitioner surveys, or usage data would provide stronger evidence. No such public logs or surveys exist for this narrow domain, and obtaining them would require resources and approvals outside the scope of a benchmark-release paper. Source selection was instead driven by the established authority of the seven expert-authored collections (WHO guidelines, standard midwifery texts, and similar resources) that are routinely used in training and reference for nurse-midwives and equivalent practitioners. The scope classifier was trained and audited specifically to retain only items within the target maternal, neonatal, and reproductive-health scope. We have revised the source-selection description and limitations sections to state more explicitly that the benchmarks constitute a high-quality proxy derived from authoritative expert sources rather than a validated sample of real-world query distributions. revision: partial

standing simulated objections not resolved

Direct empirical validation of representativeness against private clinical query logs or new practitioner surveys cannot be performed with available resources.

Circularity Check

0 steps flagged

No circularity: benchmark release relies on external sources without self-referential derivations

full rationale

The paper is a data-resource release that assembles and scope-filters existing expert-authored QA sources and guideline corpora into mamabench and mamaretrieval. No equations, fitted parameters, predictions, or derivations appear in the manuscript. The central claims rest on the external provenance of the seven sources and disclosed label-quality audits rather than any internal reduction to the paper's own inputs. Self-citation is limited to a companion paper that applies the benchmarks; it is not load-bearing for the construction or validity claims here. This matches the default non-circular case for resource papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, axioms, or invented entities; the contribution is curation and release of evaluation data.

pith-pipeline@v0.9.1-grok · 5755 in / 1116 out tokens · 31821 ms · 2026-06-30T07:36:37.493131+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 5 canonical work pages · 1 internal anchor

[1]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv:2505.08775 [cs.CL] https://arxiv.org/abs/2505.08775 Naghmeh Farzi and Laura Dietz

work page internal anchor Pith review Pith/arXiv arXiv
[2]

arXiv:2507.09488 [cs.IR] https://arxiv.org/ abs/2507.09488 Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits

Criteria-Based LLM Relevance Judgments. arXiv:2507.09488 [cs.IR] https://arxiv.org/ abs/2507.09488 Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits

work page arXiv
[3]

https://github.com/jind11/MedQA Qiao Jin, Won Kim, Qingyu Chen, Donald C

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences11, 14 (2021). https://github.com/jind11/MedQA Qiao Jin, Won Kim, Qingyu Chen, Donald C. Comeau, Lana Yeganova, W. John Wilbur, and Zhiyong Lu

2021
[4]

Bioinformatics39, 11 (2023), btad651

MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics39, 11 (2023), btad651. Omar Khattab and Matei Zaharia

2023
[5]

arXiv preprint arXiv:2403.20327 , year=

Gecko: Versatile Text Embeddings Distilled from Large Language Models. arXiv:2403.20327 [cs.CL] https://arxiv.org/abs/2403.20327 Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu

work page arXiv
[6]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Paul Mwaniki, Wycliffe Musau, Lynda Isaaka, Conrad Wanyama, Vinod Menon, Alastair K. Denniston, Xiaoxuan Liu, Mphatso Emmanual-Fabula, Gerald Williams, Bilal A. Mateen, and Ambrose Agweyu

2023
[7]

https://www.medrxiv

Benchmarking Large Language Models and Clinicians Using Locally Generated Primary Healthcare Vignettes in Kenya. https://www.medrxiv. org/content/10.1101/2025.10.25.25338798v1 medRxiv preprint 2025.10.25.25338798. 10 Ren Yi 任一 Charles Nimo, Tobi Olatunji, et al

work page doi:10.1101/2025.10.25.25338798v1 2025
[8]

InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)

AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). 1948–1973. https://aclanthology.org/2025.acl-long.96/ Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu

1948
[9]

InProceedings of the Twenty-Sixth Text REtrieval Conference (TREC)

Overview of the TREC 2017 Precision Medicine Track. InProceedings of the Twenty-Sixth Text REtrieval Conference (TREC). Stephen Robertson and Hugo Zaragoza

2017
[10]

Tefko Saracevic

The Probabilistic Relevance Framework: BM25 and Beyond.Foundations and Trends in Information Retrieval3, 4 (2009), 333–389. Tefko Saracevic

2009
[11]

Part II.Journal of the American Society for Information Science and Technology58, 13 (2007), 1915–1933

Relevance: A Review of the Literature and a Framework for Thinking on the Notion in Information Science. Part II.Journal of the American Society for Information Science and Technology58, 13 (2007), 1915–1933. The Lumos AI Labs

2007
[12]

arXiv:2406.06519 [cs.IR] https://arxiv.org/abs/2406.06519 Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang

UMBRELA: The Open-Source Reproduction of the Bing Relevance Assessor. arXiv:2406.06519 [cs.IR] https://arxiv.org/abs/2406.06519 Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang

work page arXiv
[13]

how often to check BP

------------------------------------------------------------------------ What fraction of the chunk text is directly useful for answering the specific query? (Not the broader topic -- the specific query.) 0 -- useful content is < 25% of the chunk: long chunk with one buried relevant sentence; mostly off-topic 12 Ren Yi 任一 surrounding text; the answer exis...

2016

[1] [1]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv:2505.08775 [cs.CL] https://arxiv.org/abs/2505.08775 Naghmeh Farzi and Laura Dietz

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

arXiv:2507.09488 [cs.IR] https://arxiv.org/ abs/2507.09488 Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits

Criteria-Based LLM Relevance Judgments. arXiv:2507.09488 [cs.IR] https://arxiv.org/ abs/2507.09488 Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits

work page arXiv

[3] [3]

https://github.com/jind11/MedQA Qiao Jin, Won Kim, Qingyu Chen, Donald C

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences11, 14 (2021). https://github.com/jind11/MedQA Qiao Jin, Won Kim, Qingyu Chen, Donald C. Comeau, Lana Yeganova, W. John Wilbur, and Zhiyong Lu

2021

[4] [4]

Bioinformatics39, 11 (2023), btad651

MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics39, 11 (2023), btad651. Omar Khattab and Matei Zaharia

2023

[5] [5]

arXiv preprint arXiv:2403.20327 , year=

Gecko: Versatile Text Embeddings Distilled from Large Language Models. arXiv:2403.20327 [cs.CL] https://arxiv.org/abs/2403.20327 Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu

work page arXiv

[6] [6]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Paul Mwaniki, Wycliffe Musau, Lynda Isaaka, Conrad Wanyama, Vinod Menon, Alastair K. Denniston, Xiaoxuan Liu, Mphatso Emmanual-Fabula, Gerald Williams, Bilal A. Mateen, and Ambrose Agweyu

2023

[7] [7]

https://www.medrxiv

Benchmarking Large Language Models and Clinicians Using Locally Generated Primary Healthcare Vignettes in Kenya. https://www.medrxiv. org/content/10.1101/2025.10.25.25338798v1 medRxiv preprint 2025.10.25.25338798. 10 Ren Yi 任一 Charles Nimo, Tobi Olatunji, et al

work page doi:10.1101/2025.10.25.25338798v1 2025

[8] [8]

InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)

AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). 1948–1973. https://aclanthology.org/2025.acl-long.96/ Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu

1948

[9] [9]

InProceedings of the Twenty-Sixth Text REtrieval Conference (TREC)

Overview of the TREC 2017 Precision Medicine Track. InProceedings of the Twenty-Sixth Text REtrieval Conference (TREC). Stephen Robertson and Hugo Zaragoza

2017

[10] [10]

Tefko Saracevic

The Probabilistic Relevance Framework: BM25 and Beyond.Foundations and Trends in Information Retrieval3, 4 (2009), 333–389. Tefko Saracevic

2009

[11] [11]

Part II.Journal of the American Society for Information Science and Technology58, 13 (2007), 1915–1933

Relevance: A Review of the Literature and a Framework for Thinking on the Notion in Information Science. Part II.Journal of the American Society for Information Science and Technology58, 13 (2007), 1915–1933. The Lumos AI Labs

2007

[12] [12]

arXiv:2406.06519 [cs.IR] https://arxiv.org/abs/2406.06519 Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang

UMBRELA: The Open-Source Reproduction of the Bing Relevance Assessor. arXiv:2406.06519 [cs.IR] https://arxiv.org/abs/2406.06519 Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang

work page arXiv

[13] [13]

how often to check BP

------------------------------------------------------------------------ What fraction of the chunk text is directly useful for answering the specific query? (Not the broader topic -- the specific query.) 0 -- useful content is < 25% of the chunk: long chunk with one buried relevant sentence; mostly off-topic 12 Ren Yi 任一 surrounding text; the answer exis...

2016