arxiv: 2605.12361 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.AI· cs.IR

Recognition: 2 theorem links

· Lean Theorem

MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

Rezarta Islamaj , Robert Leaman , Joey Chan , Nicholas Wan , Qiao Jin , Natalie Xie , John Wilbur , Shubo Tian

show 8 more authors

Lana Yeganova Po-Ting Lai Chih-Hsuan Wei Yifan Yang Yao Ge Qingqing Zhu Zhizheng Wang Zhiyong Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords biomedical question answeringmulti-hop reasoningLLM evaluationbenchmark datasetdisease-centered QAcontamination resistancecompositional reasoningopen-ended QA

0 comments

The pith

MedHopQA provides a benchmark of 1,000 expert questions that each require synthesizing information from two distinct Wikipedia articles to answer open-ended biomedical queries about diseases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing biomedical QA benchmarks let models succeed through pattern matching, answer elimination, or memorized data rather than genuine reasoning across sources. It introduces MedHopQA as a set of 1,000 disease-centered questions built so that each one needs facts from two separate articles, presented in free-text form with ontology synonym sets for flexible scoring. The questions sit inside a larger public collection of 10,000 items whose answers are withheld on a leaderboard, and the dataset was assembled through human curation plus LLM-assisted validation. This design is offered both as a ready benchmark and as a reusable construction process meant to keep future datasets resistant to saturation and contamination. If the approach works, evaluation of LLMs for tasks such as diagnostic support or literature discovery will rest on harder-to-game tests of compositional reasoning.

Core claim

MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs. Each question requires synthesis of information across two distinct Wikipedia articles and is supplied in open-ended free-text format, augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy for lexical and concept-level evaluation. The dataset was constructed through a structured process of human annotation, triage, iterative verification, and LLM-as-a-judge validation, then embedded within a publicly available collection of 10,000 questions with answers withheld on a CodaBench leaderboard to limit contamination.

What carries the argument

The requirement that each question integrates facts from exactly two distinct Wikipedia articles, enforced by expert curation and a hidden-answer leaderboard structure that separates the scored items from the larger public set.

If this is right

Models must demonstrate cross-article inference instead of single-document lookup or elimination strategies to perform well.
The construction process can be reused to generate additional biomedical or domain-specific multi-hop datasets that maintain discriminative power.
Evaluation at both surface and concept levels is supported by the provided ontology synonym sets.
The benchmark can serve as a more durable test for clinically relevant capabilities such as literature-based discovery and hypothesis generation.
Embedding scored items in a larger withheld-answer set reduces the risk that high performance stems from training-data contamination.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the two-article requirement holds, training objectives that reward explicit cross-document chaining could close the gap between current model performance and the benchmark's demands.
The same curation-plus-hidden-set method could be applied to create multi-hop tests in non-biomedical domains where contamination is also a concern.
Persistent low performance on MedHopQA would point to specific limits in how current LLMs integrate distributed factual knowledge rather than isolated recall.
Extending the framework to questions that draw on more than two sources would create a natural next test of deeper compositional reasoning.

Load-bearing premise

The expert-curated questions genuinely require synthesis across two distinct sources rather than being solvable by single-article lookup or surface pattern matching.

What would settle it

A model that reaches high accuracy on the scored questions after being shown only one of the two source articles for each question, or after the top leaderboard scores saturate within a short period.

read the original abstract

Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedHopQA gives a clean new benchmark for multi-hop biomedical QA with smart contamination controls, but the core claim that questions need two sources still lacks direct checks.

read the letter

The paper's main contribution is a 1000-question open-ended dataset where each item is built to pull facts from two separate Wikipedia articles on diseases, plus a reusable construction process that mixes human curation with LLM-as-judge steps. They also release a larger 10k-question pool with answers hidden to limit leaderboard gaming and training-data overlap. That setup is a practical step beyond the usual multiple-choice or saturated exam sets in the field.

Referee Report

2 major / 2 minor

Summary. The paper introduces MedHopQA, a benchmark of 1,000 expert-curated open-ended QA pairs for disease-centered multi-hop reasoning in biomedicine. Each question is constructed to require synthesis of information from two distinct Wikipedia articles, with gold answers augmented by ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy. The dataset is built via a process of human annotation, triage, iterative verification, and LLM-as-a-judge validation, and is embedded within a public set of 10,000 questions (answers withheld) on a CodaBench leaderboard to reduce contamination risk. The work also presents a reusable framework for future biomedical QA datasets that emphasize compositional reasoning, saturation resistance, and contamination resistance.

Significance. If the multi-hop property holds and the questions cannot be solved via single-article lookup or surface patterns, MedHopQA would address a clear gap in existing biomedical QA benchmarks, which often permit success through elimination or memorization rather than inference. The ontology-augmented evaluation and the 10k-withheld-answers design are concrete strengths that support lexical/concept-level scoring and long-term leaderboard utility. The reusable framework could help future dataset creators enforce similar constraints. These elements would be valuable for advancing LLM evaluation in clinically relevant tasks such as diagnostic support and literature-based discovery.

major comments (2)

[Abstract] Abstract: The assertion that 'each question requires synthesis of information across two distinct Wikipedia articles' is presented without any quantitative validation, such as single-article retrieval accuracy, inter-annotator agreement on source necessity, or ablation results showing performance degradation when one article is withheld. This directly undermines the central claim that the benchmark tests multi-hop reasoning rather than single-source lookup or pattern matching.
[Construction process] Construction process description: No inter-annotator agreement statistics, example question breakdowns, or empirical checks (e.g., human or model performance on single vs. dual articles) are reported to confirm that the human annotation plus LLM-as-a-judge pipeline produces genuinely compositional items. This is load-bearing for both the benchmark validity and the reusable framework's claimed properties.

minor comments (2)

[Abstract] The abstract would benefit from one or two concrete example questions to illustrate the multi-hop requirement and the open-ended answer format.
[Dataset release] Clarify how the 1,000 scored questions are selected from the 10,000 public set and whether any leakage-prevention measures (e.g., temporal or source filtering) are applied beyond answer withholding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which correctly identify the need for stronger empirical support of the multi-hop claims. We address each major comment below. Where the manuscript lacks quantitative validation, we agree that revisions are required and will incorporate the suggested analyses.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'each question requires synthesis of information across two distinct Wikipedia articles' is presented without any quantitative validation, such as single-article retrieval accuracy, inter-annotator agreement on source necessity, or ablation results showing performance degradation when one article is withheld. This directly undermines the central claim that the benchmark tests multi-hop reasoning rather than single-source lookup or pattern matching.

Authors: We agree that the current manuscript presents the multi-hop requirement as a design property without accompanying quantitative evidence. The claim originates from the annotation guidelines, which explicitly required questions to draw non-redundant information from two distinct articles, followed by human triage and LLM-as-a-judge consistency checks. However, we did not report single-article ablations, retrieval accuracy, or source-necessity agreement. In the revision we will add a dedicated validation subsection that includes: (i) LLM performance on each question when provided with only the first article, only the second article, or both; (ii) retrieval accuracy of the two source articles given the question; and (iii) inter-annotator agreement on whether both articles are necessary, measured on a 100-question subset. These results will be reported in the main text and will directly test whether single-article lookup suffices. revision: yes
Referee: [Construction process] Construction process description: No inter-annotator agreement statistics, example question breakdowns, or empirical checks (e.g., human or model performance on single vs. dual articles) are reported to confirm that the human annotation plus LLM-as-a-judge pipeline produces genuinely compositional items. This is load-bearing for both the benchmark validity and the reusable framework's claimed properties.

Authors: We acknowledge that the construction section currently lacks these supporting statistics and examples. The process consisted of expert biomedical annotators, iterative triage, and LLM-as-a-judge validation, but specific inter-annotator agreement figures and per-question breakdowns were omitted. In the revised manuscript we will: (1) provide 2–3 fully worked example questions with explicit mapping of required facts to each Wikipedia article; (2) report inter-annotator agreement (Cohen’s kappa) on question validity and source necessity for a randomly sampled subset of 200 items; and (3) include the single-vs-dual article model performance results described in the response to the first comment. These additions will also illustrate how the reusable framework can enforce compositional checks in future datasets. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction with no derivations or self-referential reductions

full rationale

The paper describes the manual curation, triage, verification, and LLM-assisted validation of a 1,000-question multi-hop QA benchmark drawn from Wikipedia articles. No equations, fitted parameters, or predictive derivations appear in the provided text or abstract. The central claim—that questions require synthesis across two distinct sources—is asserted via the construction protocol itself rather than being defined in terms of any output or prior self-result. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The work is therefore self-contained as a descriptive benchmark-creation effort.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumptions that Wikipedia articles form reliable sources for biomedical multi-hop reasoning, that the described human-plus-LLM curation process yields high-quality questions, and that embedding the test set inside a larger public collection sufficiently mitigates contamination.

axioms (2)

domain assumption Expert human annotation combined with LLM-as-a-judge validation produces questions that require genuine cross-article synthesis.
Invoked in the abstract's description of the construction process.
domain assumption Ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy enable reliable concept-level evaluation.
Stated as part of the gold annotation augmentation.

pith-pipeline@v0.9.0 · 5645 in / 1398 out tokens · 77998 ms · 2026-05-13T04:16:50.656949+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each question requires synthesis of information across two distinct Wikipedia articles... constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

[1]

Sci Data 2021, 8(1):91

Islamaj R, Leaman R, Kim S, Kwon D, Wei CH, Comeau DC, Peng Y , Cissel D, Coss C, Fisher C et al: NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature. Sci Data 2021, 8(1):91

work page 2021
[2]

Database (Oxford) 2024, 2024

Islamaj R, Wei CH, Lai PT, Luo L, Coss C, Gokal Kochar P , Miliaras N, Rodionov O, Sekiya K, Trinh D et al: The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop. Database (Oxford) 2024, 2024

work page 2024
[3]

J Biomed Inform 2014, 47:1-10

Dogan RI, Leaman R, Lu Z: NCBI disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform 2014, 47:1-10

work page 2014
[4]

BMC Bioinformatics 2005, 6 Suppl 1(Suppl 1):S1

Hirschman L, Yeh A, Blaschke C, Valencia A: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 2005, 6 Suppl 1(Suppl 1):S1

work page 2005
[5]

Genome Biol 2008, 9 Suppl 2(Suppl 2):S1

Krallinger M, Morgan A, Smith L, Leitner F , Tanabe L, Wilbur J, Hirschman L, Valencia A: Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol 2008, 9 Suppl 2(Suppl 2):S1

work page 2008
[6]

BMC Bioinformatics 2011, 12 Suppl 8(Suppl 8):S3

Krallinger M, Vazquez M, Leitner F , Salgado D, Chatr-Aryamontri A, Winter A, Perfetto L, Briganti L, Licata L, Iannuccelli M et al: The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 2011, 12 Suppl 8(Suppl 8):S3

work page 2011
[7]

In: June 2009; Boulder, Colorado

Kim J-D, Ohta T, Pyysalo S, Kano Y , Tsujii Ji: Overview of BioNLP’09 Shared Task on Event Extraction. In: June 2009; Boulder, Colorado. Association for Computational Linguistics: 1-9

work page 2009
[8]

Proceedings of the fifth BioCreative challenge evaluation workshop 2015:173-182

Li J, Sun Y , Johnson R, Sciaky D, Wei CH, Leaman R, Davis AP , Mattingly CJ, Wiegers TC, Lu Z: Annotating chemicals, diseases, and their interactions in biomedical literature. Proceedings of the fifth BioCreative challenge evaluation workshop 2015:173-182

work page 2015
[9]

arXiv preprint arXiv:210807258 2021

Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, Bernstein MS, Bohg J, Bosselut A, Brunskill E: On the opportunities and risks of foundation models. arXiv preprint arXiv:210807258 2021. 26

work page 2021
[10]

Nature 2023, 620(7972):172-180

Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole- Lewis H, Pfohl S et al: Large language models encode clinical knowledge. Nature 2023, 620(7972):172-180

work page 2023
[11]

Nature Medicine 2023, 29(8):1930-1940

Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF , Ting DSW: Large language models in medicine. Nature Medicine 2023, 29(8):1930-1940

work page 2023
[12]

Applied Sciences 2021, 11(14):6421

Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P: What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Applied Sciences 2021, 11(14):6421

work page 2021
[13]

In: Proceedings of the Conference on Health, Inference, and Learning; Proceedings of Machine Learning Research: Edited by Gerardo F , George HC, Tom P , Joyce CH, Tristan N

Pal A, Umapathi LK, Sankarasubbu M: MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. In: Proceedings of the Conference on Health, Inference, and Learning; Proceedings of Machine Learning Research: Edited by Gerardo F , George HC, Tom P , Joyce CH, Tristan N. PMLR 2022: 248--260

work page 2022
[14]

In: November 2019; Hong Kong, China

Jin Q, Dhingra B, Liu Z, Cohen W, Lu X: PubMedQA: A Dataset for Biomedical Research Question Answering. In: November 2019; Hong Kong, China. Association for Computational Linguistics: 2567-2577

work page 2019
[15]

In: International Conference on Learning Representations: 2021

Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, Steinhardt J: Measuring massive multitask language understanding. In: International Conference on Learning Representations: 2021

work page 2021
[16]

[https://www.vals.ai/benchmarks/medqa]

MedQA:Evaluating language model bias in medical questions. [https://www.vals.ai/benchmarks/medqa]

work page
[17]

Massive Multitask Language Understanding (MMLU) on HELM [https://crfm.stanford.edu/helm/mmlu/latest/]

work page
[18]

Transactions on Machine Learning Research 2023

Liang P , Bommasani R, Lee T, Tsipras D, Soylu D, Yasunaga M, Zhang Y , Narayanan D, Wu Y , Kumar A: Holistic evaluation of language models. Transactions on Machine Learning Research 2023

work page 2023
[19]

arXiv preprint arXiv:250506108 2025

Justen L: Llms outperform experts on challenging biology benchmarks. arXiv preprint arXiv:250506108 2025

work page 2025
[20]

In: The Twelfth International Conference on Learning Representations (ICLR): 2024

Golchin S, Surdeanu M: Time Travel in LLMs: Tracing Data Contamination in Large Language Models. In: The Twelfth International Conference on Learning Representations (ICLR): 2024

work page 2024
[21]

In: December 2023; Singapore

Sainz O, Campos J, García-Ferrero I, Etxaniz J, de Lacalle OL, Agirre E: NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. In: December 2023; Singapore. Association for Computational Linguistics: 10776-10787

work page 2023
[22]

Islamaj R, Chan J, Leaman R, Lu Z: Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering. In: Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artifici...

work page 2025
[23]

BMC Bioinformatics 2015, 16:138

Tsatsaronis G, Balikas G, Malakasiotis P , Partalas I, Zschunke M, Alvers MR, Weissenborn D, Krithara A, Petridis S, Polychronopoulos D et al: An overview of the 27 BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 2015, 16:138

work page 2015
[24]

In: May 2018; Miyazaki, Japan

Pappas D, Androutsopoulos I, Papageorgiou H: BioRead: A New Dataset for Biomedical Reading Comprehension. In: May 2018; Miyazaki, Japan. European Language Resources Association (ELRA)

work page 2018
[25]

In: Proceedings of the 2018 conference on empirical methods in natural language processing: 2018

Pampari A, Raghavan P , Liang J, Peng J: emrqa: A large corpus for question answering on electronic medical records. In: Proceedings of the 2018 conference on empirical methods in natural language processing: 2018. 2357-2368

work page 2018
[26]

In: oct nov 2018; Brussels, Belgium

Romanov A, Shivade C: Lessons from Natural Language Inference in the Clinical Domain. In: oct nov 2018; Brussels, Belgium. Association for Computational Linguistics: 1586-1596

work page 2018
[27]

In: June 2022; Marseille, France

Soni S, Gudala M, Pajouhi A, Roberts K: RadQA: A Question Answering Dataset to Improve Comprehension of Radiology Reports. In: June 2022; Marseille, France. European Language Resources Association: 6250-6259

work page 2022
[28]

In: July 2025; Vienna, Austria

Nimo C, Olatunji T, Owodunni AT, Abdullahi T, Ayodele E, Sanni M, Aka EC, Omofoye F , Yuehgoh F , Faniran T et al: AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset. In: July 2025; Vienna, Austria. Association for Computational Linguistics: 1948-1973

work page 2025
[29]

Digit Health 2025, 11:20552076251390447

Liu J, Liu S: HealthBench: Advancing AI evaluation in healthcare, but not yet clinically ready. Digit Health 2025, 11:20552076251390447

work page 2025
[30]

In: August 2024; Bangkok, Thailand

Manes I, Ronn N, Cohen D, Ilan Ber R, Horowitz-Kugler Z, Stanovsky G: K-QA: A Real-World Medical Q&A Benchmark. In: August 2024; Bangkok, Thailand. Association for Computational Linguistics: 277-294

work page 2024
[31]

In: August 2024; Bangkok, Thailand

Vladika J, Schneider P , Matthes F: MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering. In: August 2024; Bangkok, Thailand. Association for Computational Linguistics: 14459-14469

work page 2024
[32]

Journal of Healthcare Informatics Research 2025, 9(3):280- 296

Adams L, Busch F , Han T, Excoffier J-B, Ortala M, Löser A, Aerts HJWL, Kather JN, Truhn D, Bressem K: LongHealth: A Question Answering Benchmark with Long Clinical Documents. Journal of Healthcare Informatics Research 2025, 9(3):280- 296

work page 2025
[33]

In: August 2025; Viena, Austria

Colelough B, Bartels D, Demner-Fushman D: Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering. In: August 2025; Viena, Austria. Association for Computational Linguistics: 378-387

work page 2025
[34]

In: oct nov 2018; Brussels, Belgium

Yang Z, Qi P , Zhang S, Bengio Y , Cohen W, Salakhutdinov R, Manning CD: HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In: oct nov 2018; Brussels, Belgium. Association for Computational Linguistics: 2369-2380

work page 2018
[35]

Transactions of the Association for Computational Linguistics 2022, 10:539-554

Trivedi H, Balasubramanian N, Khot T, Sabharwal A: ♫ MuSiQue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics 2022, 10:539-554

work page 2022
[36]

In: December 2020; Barcelona, Spain (Online)

Ho X, Duong Nguyen A-K, Sugawara S, Aizawa A: Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. In: December 2020; Barcelona, Spain (Online). International Committee on Computational Linguistics: 6609-6625. 28

work page 2020
[37]

Transactions of the Association for Computational Linguistics 2018, 6:287-302

Welbl J, Stenetorp P , Riedel S: Constructing Datasets for Multi-hop Reading Comprehension Across Documents. Transactions of the Association for Computational Linguistics 2018, 6:287-302

work page 2018
[38]

In: July 2025; Vienna, Austria

Kim Y , Abdulle Y , Wu H: BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain. In: July 2025; Vienna, Austria. Association for Computational Linguistics: 12894-12908

work page 2025
[39]

In: June 2021; Online

Ben Abacha A, Mrabet Y , Zhang Y , Shivade C, Langlotz C, Demner-Fushman D: Overview of the MEDIQA 2021 Shared Task on Summarization in the Medical Domain. In: June 2021; Online. Association for Computational Linguistics: 74-85

work page 2021
[40]

In: August 2019; Florence, Italy

Ben Abacha A, Shivade C, Demner-Fushman D: Overview of the MEDIQA 2019 Shared Task on Textual Inference, Question Entailment and Question Answering. In: August 2019; Florence, Italy. Association for Computational Linguistics: 370-379

work page 2019
[41]

In: July 2020; Online

Möller T, Reina A, Jayakumar R, Pietsch M: COVID-QA: A Question Answering Dataset for COVID-19. In: July 2020; Online. Association for Computational Linguistics

work page 2020
[42]

In: November 2020; Online

Zhu M, Ahuja A, Juan D-C, Wei W, Reddy CK: Question Answering with Long Multiple-Span Answers. In: November 2020; Online. Association for Computational Linguistics: 3840-3849

work page 2020
[43]

In: Text Retrieval Conference: 2017

Abacha AB, Agichtein E, Pinter Y , Demner-Fushman D: Overview of the Medical Question Answering Task at TREC 2017 LiveQA. In: Text Retrieval Conference: 2017

work page 2017
[44]

In: AMIA Annual Symposium Proceedings: 2025

Kell G, Roberts A, Umansky S, Khare Y , Ahmed N, Patel N, Simela C, Coumbe J, Rozario J, Griffiths R-R: RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions. In: AMIA Annual Symposium Proceedings: 2025. 590

work page 2025
[45]

In: August 2024; Bangkok, Thailand

Kim Y , Wu J, Abdulle Y , Wu H: MedExQA: Medical Question Answering Benchmark with Multiple Explanations. In: August 2024; Bangkok, Thailand. Association for Computational Linguistics: 167-181

work page 2024
[46]

NPJ Digit Med 2026, 9(1)

Rogoz AC, Ionescu RT, Anghel AV , Antone-Iordache IL, Coniac S, Ionescu AI: A large- scale benchmark for evaluating large language models on medical question answering in Romanian. NPJ Digit Med 2026, 9(1)

work page 2026
[47]

Bioinformatics 2023, 39(11)

Jin Q, Kim W, Chen Q, Comeau DC, Yeganova L, Wilbur WJ, Lu Z: MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics 2023, 39(11)

work page 2023