pith. machine review for the scientific record. sign in

arxiv: 2605.12361 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.AI· cs.IR

Recognition: 2 theorem links

· Lean Theorem

MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords biomedical question answeringmulti-hop reasoningLLM evaluationbenchmark datasetdisease-centered QAcontamination resistancecompositional reasoningopen-ended QA
0
0 comments X

The pith

MedHopQA provides a benchmark of 1,000 expert questions that each require synthesizing information from two distinct Wikipedia articles to answer open-ended biomedical queries about diseases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing biomedical QA benchmarks let models succeed through pattern matching, answer elimination, or memorized data rather than genuine reasoning across sources. It introduces MedHopQA as a set of 1,000 disease-centered questions built so that each one needs facts from two separate articles, presented in free-text form with ontology synonym sets for flexible scoring. The questions sit inside a larger public collection of 10,000 items whose answers are withheld on a leaderboard, and the dataset was assembled through human curation plus LLM-assisted validation. This design is offered both as a ready benchmark and as a reusable construction process meant to keep future datasets resistant to saturation and contamination. If the approach works, evaluation of LLMs for tasks such as diagnostic support or literature discovery will rest on harder-to-game tests of compositional reasoning.

Core claim

MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs. Each question requires synthesis of information across two distinct Wikipedia articles and is supplied in open-ended free-text format, augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy for lexical and concept-level evaluation. The dataset was constructed through a structured process of human annotation, triage, iterative verification, and LLM-as-a-judge validation, then embedded within a publicly available collection of 10,000 questions with answers withheld on a CodaBench leaderboard to limit contamination.

What carries the argument

The requirement that each question integrates facts from exactly two distinct Wikipedia articles, enforced by expert curation and a hidden-answer leaderboard structure that separates the scored items from the larger public set.

If this is right

  • Models must demonstrate cross-article inference instead of single-document lookup or elimination strategies to perform well.
  • The construction process can be reused to generate additional biomedical or domain-specific multi-hop datasets that maintain discriminative power.
  • Evaluation at both surface and concept levels is supported by the provided ontology synonym sets.
  • The benchmark can serve as a more durable test for clinically relevant capabilities such as literature-based discovery and hypothesis generation.
  • Embedding scored items in a larger withheld-answer set reduces the risk that high performance stems from training-data contamination.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the two-article requirement holds, training objectives that reward explicit cross-document chaining could close the gap between current model performance and the benchmark's demands.
  • The same curation-plus-hidden-set method could be applied to create multi-hop tests in non-biomedical domains where contamination is also a concern.
  • Persistent low performance on MedHopQA would point to specific limits in how current LLMs integrate distributed factual knowledge rather than isolated recall.
  • Extending the framework to questions that draw on more than two sources would create a natural next test of deeper compositional reasoning.

Load-bearing premise

The expert-curated questions genuinely require synthesis across two distinct sources rather than being solvable by single-article lookup or surface pattern matching.

What would settle it

A model that reaches high accuracy on the scored questions after being shown only one of the two source articles for each question, or after the top leaderboard scores saturate within a short period.

read the original abstract

Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MedHopQA, a benchmark of 1,000 expert-curated open-ended QA pairs for disease-centered multi-hop reasoning in biomedicine. Each question is constructed to require synthesis of information from two distinct Wikipedia articles, with gold answers augmented by ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy. The dataset is built via a process of human annotation, triage, iterative verification, and LLM-as-a-judge validation, and is embedded within a public set of 10,000 questions (answers withheld) on a CodaBench leaderboard to reduce contamination risk. The work also presents a reusable framework for future biomedical QA datasets that emphasize compositional reasoning, saturation resistance, and contamination resistance.

Significance. If the multi-hop property holds and the questions cannot be solved via single-article lookup or surface patterns, MedHopQA would address a clear gap in existing biomedical QA benchmarks, which often permit success through elimination or memorization rather than inference. The ontology-augmented evaluation and the 10k-withheld-answers design are concrete strengths that support lexical/concept-level scoring and long-term leaderboard utility. The reusable framework could help future dataset creators enforce similar constraints. These elements would be valuable for advancing LLM evaluation in clinically relevant tasks such as diagnostic support and literature-based discovery.

major comments (2)
  1. [Abstract] Abstract: The assertion that 'each question requires synthesis of information across two distinct Wikipedia articles' is presented without any quantitative validation, such as single-article retrieval accuracy, inter-annotator agreement on source necessity, or ablation results showing performance degradation when one article is withheld. This directly undermines the central claim that the benchmark tests multi-hop reasoning rather than single-source lookup or pattern matching.
  2. [Construction process] Construction process description: No inter-annotator agreement statistics, example question breakdowns, or empirical checks (e.g., human or model performance on single vs. dual articles) are reported to confirm that the human annotation plus LLM-as-a-judge pipeline produces genuinely compositional items. This is load-bearing for both the benchmark validity and the reusable framework's claimed properties.
minor comments (2)
  1. [Abstract] The abstract would benefit from one or two concrete example questions to illustrate the multi-hop requirement and the open-ended answer format.
  2. [Dataset release] Clarify how the 1,000 scored questions are selected from the 10,000 public set and whether any leakage-prevention measures (e.g., temporal or source filtering) are applied beyond answer withholding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which correctly identify the need for stronger empirical support of the multi-hop claims. We address each major comment below. Where the manuscript lacks quantitative validation, we agree that revisions are required and will incorporate the suggested analyses.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'each question requires synthesis of information across two distinct Wikipedia articles' is presented without any quantitative validation, such as single-article retrieval accuracy, inter-annotator agreement on source necessity, or ablation results showing performance degradation when one article is withheld. This directly undermines the central claim that the benchmark tests multi-hop reasoning rather than single-source lookup or pattern matching.

    Authors: We agree that the current manuscript presents the multi-hop requirement as a design property without accompanying quantitative evidence. The claim originates from the annotation guidelines, which explicitly required questions to draw non-redundant information from two distinct articles, followed by human triage and LLM-as-a-judge consistency checks. However, we did not report single-article ablations, retrieval accuracy, or source-necessity agreement. In the revision we will add a dedicated validation subsection that includes: (i) LLM performance on each question when provided with only the first article, only the second article, or both; (ii) retrieval accuracy of the two source articles given the question; and (iii) inter-annotator agreement on whether both articles are necessary, measured on a 100-question subset. These results will be reported in the main text and will directly test whether single-article lookup suffices. revision: yes

  2. Referee: [Construction process] Construction process description: No inter-annotator agreement statistics, example question breakdowns, or empirical checks (e.g., human or model performance on single vs. dual articles) are reported to confirm that the human annotation plus LLM-as-a-judge pipeline produces genuinely compositional items. This is load-bearing for both the benchmark validity and the reusable framework's claimed properties.

    Authors: We acknowledge that the construction section currently lacks these supporting statistics and examples. The process consisted of expert biomedical annotators, iterative triage, and LLM-as-a-judge validation, but specific inter-annotator agreement figures and per-question breakdowns were omitted. In the revised manuscript we will: (1) provide 2–3 fully worked example questions with explicit mapping of required facts to each Wikipedia article; (2) report inter-annotator agreement (Cohen’s kappa) on question validity and source necessity for a randomly sampled subset of 200 items; and (3) include the single-vs-dual article model performance results described in the response to the first comment. These additions will also illustrate how the reusable framework can enforce compositional checks in future datasets. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction with no derivations or self-referential reductions

full rationale

The paper describes the manual curation, triage, verification, and LLM-assisted validation of a 1,000-question multi-hop QA benchmark drawn from Wikipedia articles. No equations, fitted parameters, or predictive derivations appear in the provided text or abstract. The central claim—that questions require synthesis across two distinct sources—is asserted via the construction protocol itself rather than being defined in terms of any output or prior self-result. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The work is therefore self-contained as a descriptive benchmark-creation effort.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumptions that Wikipedia articles form reliable sources for biomedical multi-hop reasoning, that the described human-plus-LLM curation process yields high-quality questions, and that embedding the test set inside a larger public collection sufficiently mitigates contamination.

axioms (2)
  • domain assumption Expert human annotation combined with LLM-as-a-judge validation produces questions that require genuine cross-article synthesis.
    Invoked in the abstract's description of the construction process.
  • domain assumption Ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy enable reliable concept-level evaluation.
    Stated as part of the gold annotation augmentation.

pith-pipeline@v0.9.0 · 5645 in / 1398 out tokens · 77998 ms · 2026-05-13T04:16:50.656949+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

  1. [1]

    Sci Data 2021, 8(1):91

    Islamaj R, Leaman R, Kim S, Kwon D, Wei CH, Comeau DC, Peng Y , Cissel D, Coss C, Fisher C et al: NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature. Sci Data 2021, 8(1):91

  2. [2]

    Database (Oxford) 2024, 2024

    Islamaj R, Wei CH, Lai PT, Luo L, Coss C, Gokal Kochar P , Miliaras N, Rodionov O, Sekiya K, Trinh D et al: The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop. Database (Oxford) 2024, 2024

  3. [3]

    J Biomed Inform 2014, 47:1-10

    Dogan RI, Leaman R, Lu Z: NCBI disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform 2014, 47:1-10

  4. [4]

    BMC Bioinformatics 2005, 6 Suppl 1(Suppl 1):S1

    Hirschman L, Yeh A, Blaschke C, Valencia A: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 2005, 6 Suppl 1(Suppl 1):S1

  5. [5]

    Genome Biol 2008, 9 Suppl 2(Suppl 2):S1

    Krallinger M, Morgan A, Smith L, Leitner F , Tanabe L, Wilbur J, Hirschman L, Valencia A: Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol 2008, 9 Suppl 2(Suppl 2):S1

  6. [6]

    BMC Bioinformatics 2011, 12 Suppl 8(Suppl 8):S3

    Krallinger M, Vazquez M, Leitner F , Salgado D, Chatr-Aryamontri A, Winter A, Perfetto L, Briganti L, Licata L, Iannuccelli M et al: The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 2011, 12 Suppl 8(Suppl 8):S3

  7. [7]

    In: June 2009; Boulder, Colorado

    Kim J-D, Ohta T, Pyysalo S, Kano Y , Tsujii Ji: Overview of BioNLP’09 Shared Task on Event Extraction. In: June 2009; Boulder, Colorado. Association for Computational Linguistics: 1-9

  8. [8]

    Proceedings of the fifth BioCreative challenge evaluation workshop 2015:173-182

    Li J, Sun Y , Johnson R, Sciaky D, Wei CH, Leaman R, Davis AP , Mattingly CJ, Wiegers TC, Lu Z: Annotating chemicals, diseases, and their interactions in biomedical literature. Proceedings of the fifth BioCreative challenge evaluation workshop 2015:173-182

  9. [9]

    arXiv preprint arXiv:210807258 2021

    Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, Bernstein MS, Bohg J, Bosselut A, Brunskill E: On the opportunities and risks of foundation models. arXiv preprint arXiv:210807258 2021. 26

  10. [10]

    Nature 2023, 620(7972):172-180

    Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole- Lewis H, Pfohl S et al: Large language models encode clinical knowledge. Nature 2023, 620(7972):172-180

  11. [11]

    Nature Medicine 2023, 29(8):1930-1940

    Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF , Ting DSW: Large language models in medicine. Nature Medicine 2023, 29(8):1930-1940

  12. [12]

    Applied Sciences 2021, 11(14):6421

    Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P: What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Applied Sciences 2021, 11(14):6421

  13. [13]

    In: Proceedings of the Conference on Health, Inference, and Learning; Proceedings of Machine Learning Research: Edited by Gerardo F , George HC, Tom P , Joyce CH, Tristan N

    Pal A, Umapathi LK, Sankarasubbu M: MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. In: Proceedings of the Conference on Health, Inference, and Learning; Proceedings of Machine Learning Research: Edited by Gerardo F , George HC, Tom P , Joyce CH, Tristan N. PMLR 2022: 248--260

  14. [14]

    In: November 2019; Hong Kong, China

    Jin Q, Dhingra B, Liu Z, Cohen W, Lu X: PubMedQA: A Dataset for Biomedical Research Question Answering. In: November 2019; Hong Kong, China. Association for Computational Linguistics: 2567-2577

  15. [15]

    In: International Conference on Learning Representations: 2021

    Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, Steinhardt J: Measuring massive multitask language understanding. In: International Conference on Learning Representations: 2021

  16. [16]

    [https://www.vals.ai/benchmarks/medqa]

    MedQA:Evaluating language model bias in medical questions. [https://www.vals.ai/benchmarks/medqa]

  17. [17]

    Massive Multitask Language Understanding (MMLU) on HELM [https://crfm.stanford.edu/helm/mmlu/latest/]

  18. [18]

    Transactions on Machine Learning Research 2023

    Liang P , Bommasani R, Lee T, Tsipras D, Soylu D, Yasunaga M, Zhang Y , Narayanan D, Wu Y , Kumar A: Holistic evaluation of language models. Transactions on Machine Learning Research 2023

  19. [19]

    arXiv preprint arXiv:250506108 2025

    Justen L: Llms outperform experts on challenging biology benchmarks. arXiv preprint arXiv:250506108 2025

  20. [20]

    In: The Twelfth International Conference on Learning Representations (ICLR): 2024

    Golchin S, Surdeanu M: Time Travel in LLMs: Tracing Data Contamination in Large Language Models. In: The Twelfth International Conference on Learning Representations (ICLR): 2024

  21. [21]

    In: December 2023; Singapore

    Sainz O, Campos J, García-Ferrero I, Etxaniz J, de Lacalle OL, Agirre E: NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. In: December 2023; Singapore. Association for Computational Linguistics: 10776-10787

  22. [22]

    Islamaj R, Chan J, Leaman R, Lu Z: Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering. In: Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artifici...

  23. [23]

    BMC Bioinformatics 2015, 16:138

    Tsatsaronis G, Balikas G, Malakasiotis P , Partalas I, Zschunke M, Alvers MR, Weissenborn D, Krithara A, Petridis S, Polychronopoulos D et al: An overview of the 27 BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 2015, 16:138

  24. [24]

    In: May 2018; Miyazaki, Japan

    Pappas D, Androutsopoulos I, Papageorgiou H: BioRead: A New Dataset for Biomedical Reading Comprehension. In: May 2018; Miyazaki, Japan. European Language Resources Association (ELRA)

  25. [25]

    In: Proceedings of the 2018 conference on empirical methods in natural language processing: 2018

    Pampari A, Raghavan P , Liang J, Peng J: emrqa: A large corpus for question answering on electronic medical records. In: Proceedings of the 2018 conference on empirical methods in natural language processing: 2018. 2357-2368

  26. [26]

    In: oct nov 2018; Brussels, Belgium

    Romanov A, Shivade C: Lessons from Natural Language Inference in the Clinical Domain. In: oct nov 2018; Brussels, Belgium. Association for Computational Linguistics: 1586-1596

  27. [27]

    In: June 2022; Marseille, France

    Soni S, Gudala M, Pajouhi A, Roberts K: RadQA: A Question Answering Dataset to Improve Comprehension of Radiology Reports. In: June 2022; Marseille, France. European Language Resources Association: 6250-6259

  28. [28]

    In: July 2025; Vienna, Austria

    Nimo C, Olatunji T, Owodunni AT, Abdullahi T, Ayodele E, Sanni M, Aka EC, Omofoye F , Yuehgoh F , Faniran T et al: AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset. In: July 2025; Vienna, Austria. Association for Computational Linguistics: 1948-1973

  29. [29]

    Digit Health 2025, 11:20552076251390447

    Liu J, Liu S: HealthBench: Advancing AI evaluation in healthcare, but not yet clinically ready. Digit Health 2025, 11:20552076251390447

  30. [30]

    In: August 2024; Bangkok, Thailand

    Manes I, Ronn N, Cohen D, Ilan Ber R, Horowitz-Kugler Z, Stanovsky G: K-QA: A Real-World Medical Q&A Benchmark. In: August 2024; Bangkok, Thailand. Association for Computational Linguistics: 277-294

  31. [31]

    In: August 2024; Bangkok, Thailand

    Vladika J, Schneider P , Matthes F: MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering. In: August 2024; Bangkok, Thailand. Association for Computational Linguistics: 14459-14469

  32. [32]

    Journal of Healthcare Informatics Research 2025, 9(3):280- 296

    Adams L, Busch F , Han T, Excoffier J-B, Ortala M, Löser A, Aerts HJWL, Kather JN, Truhn D, Bressem K: LongHealth: A Question Answering Benchmark with Long Clinical Documents. Journal of Healthcare Informatics Research 2025, 9(3):280- 296

  33. [33]

    In: August 2025; Viena, Austria

    Colelough B, Bartels D, Demner-Fushman D: Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering. In: August 2025; Viena, Austria. Association for Computational Linguistics: 378-387

  34. [34]

    In: oct nov 2018; Brussels, Belgium

    Yang Z, Qi P , Zhang S, Bengio Y , Cohen W, Salakhutdinov R, Manning CD: HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In: oct nov 2018; Brussels, Belgium. Association for Computational Linguistics: 2369-2380

  35. [35]

    Transactions of the Association for Computational Linguistics 2022, 10:539-554

    Trivedi H, Balasubramanian N, Khot T, Sabharwal A: ♫ MuSiQue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics 2022, 10:539-554

  36. [36]

    In: December 2020; Barcelona, Spain (Online)

    Ho X, Duong Nguyen A-K, Sugawara S, Aizawa A: Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. In: December 2020; Barcelona, Spain (Online). International Committee on Computational Linguistics: 6609-6625. 28

  37. [37]

    Transactions of the Association for Computational Linguistics 2018, 6:287-302

    Welbl J, Stenetorp P , Riedel S: Constructing Datasets for Multi-hop Reading Comprehension Across Documents. Transactions of the Association for Computational Linguistics 2018, 6:287-302

  38. [38]

    In: July 2025; Vienna, Austria

    Kim Y , Abdulle Y , Wu H: BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain. In: July 2025; Vienna, Austria. Association for Computational Linguistics: 12894-12908

  39. [39]

    In: June 2021; Online

    Ben Abacha A, Mrabet Y , Zhang Y , Shivade C, Langlotz C, Demner-Fushman D: Overview of the MEDIQA 2021 Shared Task on Summarization in the Medical Domain. In: June 2021; Online. Association for Computational Linguistics: 74-85

  40. [40]

    In: August 2019; Florence, Italy

    Ben Abacha A, Shivade C, Demner-Fushman D: Overview of the MEDIQA 2019 Shared Task on Textual Inference, Question Entailment and Question Answering. In: August 2019; Florence, Italy. Association for Computational Linguistics: 370-379

  41. [41]

    In: July 2020; Online

    Möller T, Reina A, Jayakumar R, Pietsch M: COVID-QA: A Question Answering Dataset for COVID-19. In: July 2020; Online. Association for Computational Linguistics

  42. [42]

    In: November 2020; Online

    Zhu M, Ahuja A, Juan D-C, Wei W, Reddy CK: Question Answering with Long Multiple-Span Answers. In: November 2020; Online. Association for Computational Linguistics: 3840-3849

  43. [43]

    In: Text Retrieval Conference: 2017

    Abacha AB, Agichtein E, Pinter Y , Demner-Fushman D: Overview of the Medical Question Answering Task at TREC 2017 LiveQA. In: Text Retrieval Conference: 2017

  44. [44]

    In: AMIA Annual Symposium Proceedings: 2025

    Kell G, Roberts A, Umansky S, Khare Y , Ahmed N, Patel N, Simela C, Coumbe J, Rozario J, Griffiths R-R: RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions. In: AMIA Annual Symposium Proceedings: 2025. 590

  45. [45]

    In: August 2024; Bangkok, Thailand

    Kim Y , Wu J, Abdulle Y , Wu H: MedExQA: Medical Question Answering Benchmark with Multiple Explanations. In: August 2024; Bangkok, Thailand. Association for Computational Linguistics: 167-181

  46. [46]

    NPJ Digit Med 2026, 9(1)

    Rogoz AC, Ionescu RT, Anghel AV , Antone-Iordache IL, Coniac S, Ionescu AI: A large- scale benchmark for evaluating large language models on medical question answering in Romanian. NPJ Digit Med 2026, 9(1)

  47. [47]

    Bioinformatics 2023, 39(11)

    Jin Q, Kim W, Chen Q, Comeau DC, Yeganova L, Wilbur WJ, Lu Z: MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics 2023, 39(11)