mamabench and mamaretrieval: Benchmarks for Evaluating Medical Retrieval-Augmented Generation in Maternal, Neonatal, and Reproductive Health
Pith reviewed 2026-06-30 07:36 UTC · model grok-4.3
The pith
Two benchmarks assembled from expert sources enable evaluation of retrieval-augmented generation for maternal, neonatal, and reproductive health.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By assembling and scope-filtering existing expert-authored sources into mamabench for question answering and creating mamaretrieval with decomposed graded relevance labels over a maternal-health guideline corpus, the benchmarks provide the first public resources for evaluating retrieval-augmented generation systems on the maternal, neonatal, and reproductive health questions that nurse-midwives actually encounter, while explicitly disclosing the limits of those labels.
What carries the argument
mamabench, a 25,949-item QA set filtered from seven expert sources across multiple-choice, short-answer, and rubric tracks, together with mamaretrieval, 3,185 queries paired with 0-6 graded relevance labels over 63,650 chunks using a rubric that separates answer-providing chunks from merely topical ones.
If this is right
- LLM judges for rubric-graded answers can be calibrated using the re-scoped physician-labelled meta-evaluation from HealthBench.
- Deployed on-device maternal-health assistants can be evaluated end-to-end against both QA accuracy and retrieval quality.
- Retrieval systems can be trained and measured on continuous relevance grades rather than binary decisions.
- Future benchmark creators can adopt the same practice of reporting scope-classifier agreement, frontier-judge checks, and pooling-completeness audits.
Where Pith is reading between the lines
- The graded-relevance approach could be extended to other medical guideline corpora where topic overlap is common but direct answers are rare.
- Similar assembly from expert sources might reduce the cost of creating domain-specific benchmarks in other clinical areas that already have published guidelines.
- The explicit disclosure of label limits provides a template that could raise standards for how medical AI evaluation resources are released.
Load-bearing premise
The seven existing expert-authored sources, after scope filtering, adequately represent the range of maternal, neonatal, and reproductive-health questions that nurse-midwives actually ask in practice.
What would settle it
A survey or log of real nurse-midwife clinical queries in which more than a small fraction fall outside the topics covered by the seven filtered sources.
read the original abstract
Medical question-answering benchmarks rarely cover the maternal, neonatal, child, and reproductive-health questions a nurse-midwife asks, and, to our knowledge, no public chunk-level relevance benchmark exists for maternal-health guideline retrieval. We release two benchmarks that fill these gaps. mamabench is a scope-filtered QA set of 25,949 items assembled from seven existing expert-authored sources across multiple-choice, short-answer, and rubric-graded tracks; to help users calibrate the LLM judge that scores the rubric track, we re-scope HealthBench's physician-labelled meta-evaluation to the domain. mamaretrieval pairs 3,185 clinical queries with graded (0-6) relevance labels over a 63,650-chunk maternal-health guideline corpus, using a decomposed rubric that distinguishes a chunk that answers a query from one merely on its topic. Three decisions shape both: assemble and filter expert sources rather than author questions, grade relevance rather than binarise it, and measure and disclose the limits of the labels -- scope-classifier agreement, a frontier-judge check, and a pooling-completeness audit -- rather than treat them as an oracle. A companion paper uses the benchmarks to evaluate a deployed on-device assistant; both are released openly for research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper releases two benchmarks for medical retrieval-augmented generation focused on maternal, neonatal, and reproductive health. mamabench is a scope-filtered QA dataset of 25,949 items drawn from seven existing expert-authored sources across multiple-choice, short-answer, and rubric-graded tracks (with a re-scoped HealthBench meta-evaluation for the rubric track). mamaretrieval pairs 3,185 clinical queries with graded (0-6) relevance labels over a 63,650-chunk guideline corpus. The work emphasizes assembling and filtering expert sources rather than authoring new questions, using graded rather than binary relevance, and disclosing label limits such as scope-classifier agreement, frontier-judge checks, and pooling-completeness audits.
Significance. If the filtered sources prove representative of real clinical questions in the target setting, the benchmarks would address a documented gap in domain-specific resources for maternal-health QA and chunk-level retrieval evaluation. The transparent disclosure of label-quality metrics and the reuse of expert sources are methodological strengths that increase the resources' utility for downstream LLM and RAG assessment; the companion paper on a deployed on-device assistant further demonstrates practical relevance.
major comments (1)
- [Abstract and source-selection description] The claim that the benchmarks fill the stated gaps in coverage for the questions a nurse-midwife asks depends on the representativeness of the seven scope-filtered expert sources. The manuscript reports no comparison of the resulting 25,949 QA items or 3,185 queries against actual clinical query logs, practitioner surveys, or usage data from maternal-health settings (Abstract; source-selection description). This assumption is load-bearing for the central contribution and remains untested.
Simulated Author's Rebuttal
We thank the referee for the review and for identifying the load-bearing assumption regarding source representativeness. We respond to the single major comment below.
read point-by-point responses
-
Referee: [Abstract and source-selection description] The claim that the benchmarks fill the stated gaps in coverage for the questions a nurse-midwife asks depends on the representativeness of the seven scope-filtered expert sources. The manuscript reports no comparison of the resulting 25,949 QA items or 3,185 queries against actual clinical query logs, practitioner surveys, or usage data from maternal-health settings (Abstract; source-selection description). This assumption is load-bearing for the central contribution and remains untested.
Authors: We agree that a direct comparison against clinical query logs, practitioner surveys, or usage data would provide stronger evidence. No such public logs or surveys exist for this narrow domain, and obtaining them would require resources and approvals outside the scope of a benchmark-release paper. Source selection was instead driven by the established authority of the seven expert-authored collections (WHO guidelines, standard midwifery texts, and similar resources) that are routinely used in training and reference for nurse-midwives and equivalent practitioners. The scope classifier was trained and audited specifically to retain only items within the target maternal, neonatal, and reproductive-health scope. We have revised the source-selection description and limitations sections to state more explicitly that the benchmarks constitute a high-quality proxy derived from authoritative expert sources rather than a validated sample of real-world query distributions. revision: partial
- Direct empirical validation of representativeness against private clinical query logs or new practitioner surveys cannot be performed with available resources.
Circularity Check
No circularity: benchmark release relies on external sources without self-referential derivations
full rationale
The paper is a data-resource release that assembles and scope-filters existing expert-authored QA sources and guideline corpora into mamabench and mamaretrieval. No equations, fitted parameters, predictions, or derivations appear in the manuscript. The central claims rest on the external provenance of the seven sources and disclosed label-quality audits rather than any internal reduction to the paper's own inputs. Self-citation is limited to a companion paper that applies the benchmarks; it is not load-bearing for the construction or validity claims here. This matches the default non-circular case for resource papers.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
HealthBench: Evaluating Large Language Models Towards Improved Human Health
HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv:2505.08775 [cs.CL] https://arxiv.org/abs/2505.08775 Naghmeh Farzi and Laura Dietz
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Criteria-Based LLM Relevance Judgments. arXiv:2507.09488 [cs.IR] https://arxiv.org/ abs/2507.09488 Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits
-
[3]
https://github.com/jind11/MedQA Qiao Jin, Won Kim, Qingyu Chen, Donald C
What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences11, 14 (2021). https://github.com/jind11/MedQA Qiao Jin, Won Kim, Qingyu Chen, Donald C. Comeau, Lana Yeganova, W. John Wilbur, and Zhiyong Lu
2021
-
[4]
Bioinformatics39, 11 (2023), btad651
MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics39, 11 (2023), btad651. Omar Khattab and Matei Zaharia
2023
-
[5]
arXiv preprint arXiv:2403.20327 , year=
Gecko: Versatile Text Embeddings Distilled from Large Language Models. arXiv:2403.20327 [cs.CL] https://arxiv.org/abs/2403.20327 Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu
-
[6]
InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Paul Mwaniki, Wycliffe Musau, Lynda Isaaka, Conrad Wanyama, Vinod Menon, Alastair K. Denniston, Xiaoxuan Liu, Mphatso Emmanual-Fabula, Gerald Williams, Bilal A. Mateen, and Ambrose Agweyu
2023
-
[7]
Benchmarking Large Language Models and Clinicians Using Locally Generated Primary Healthcare Vignettes in Kenya. https://www.medrxiv. org/content/10.1101/2025.10.25.25338798v1 medRxiv preprint 2025.10.25.25338798. 10 Ren Yi 任一 Charles Nimo, Tobi Olatunji, et al
-
[8]
InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)
AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). 1948–1973. https://aclanthology.org/2025.acl-long.96/ Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu
1948
-
[9]
InProceedings of the Twenty-Sixth Text REtrieval Conference (TREC)
Overview of the TREC 2017 Precision Medicine Track. InProceedings of the Twenty-Sixth Text REtrieval Conference (TREC). Stephen Robertson and Hugo Zaragoza
2017
-
[10]
Tefko Saracevic
The Probabilistic Relevance Framework: BM25 and Beyond.Foundations and Trends in Information Retrieval3, 4 (2009), 333–389. Tefko Saracevic
2009
-
[11]
Part II.Journal of the American Society for Information Science and Technology58, 13 (2007), 1915–1933
Relevance: A Review of the Literature and a Framework for Thinking on the Notion in Information Science. Part II.Journal of the American Society for Information Science and Technology58, 13 (2007), 1915–1933. The Lumos AI Labs
2007
-
[12]
UMBRELA: The Open-Source Reproduction of the Bing Relevance Assessor. arXiv:2406.06519 [cs.IR] https://arxiv.org/abs/2406.06519 Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang
-
[13]
how often to check BP
------------------------------------------------------------------------ What fraction of the chunk text is directly useful for answering the specific query? (Not the broader topic -- the specific query.) 0 -- useful content is < 25% of the chunk: long chunk with one buried relevant sentence; mostly off-topic 12 Ren Yi 任一 surrounding text; the answer exis...
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.