From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs

Guillaume Chabot-Couture; Jessica M. Lundin; Usman Nasir Nakakana

arxiv: 2508.20810 · v3 · pith:MKLIFI23new · submitted 2025-08-28 · 💻 cs.AI · cs.CL

From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs

Jessica M. Lundin , Usman Nasir Nakakana , Guillaume Chabot-Couture This is my paper

Pith reviewed 2026-05-21 23:19 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords graph-based evaluationLLM benchmarkingclinical guidelinescontamination resistanceknowledge graphdomain-specific evaluationdynamic benchmark generation

0 comments

The pith

A graph derived from clinical guidelines generates evaluation questions that guarantee full coverage, contamination resistance, and expert validity for testing LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method that converts structured guidelines into a knowledge graph and uses graph traversal to create evaluation queries on demand. This produces benchmarks that cover every relationship in the guidelines, vary surface forms to block memorization, and inherit correctness from the original expert source. Applied to WHO IMCI guidelines, the generated questions expose that current models handle symptom identification more reliably than treatment choices or management steps. The approach also allows the entire test set to be regenerated whenever the guidelines are revised.

Core claim

By representing clinical guidelines as a queryable knowledge graph, the harness delivers three guarantees: exhaustive coverage of all guideline relationships through traversal, resistance to surface-form contamination via combinatorial question generation, and retained validity because the graph structure is built directly from expert-authored logic. When used to create multiple-choice questions on symptom recognition, treatment, severity classification, and follow-up, the resulting evaluations reveal consistent performance differences across models on the WHO IMCI guidelines.

What carries the argument

The queryable knowledge graph built from expert guidelines, traversed to instantiate new evaluation queries with combinatorial variation.

If this is right

Evaluation data can be regenerated on demand whenever guidelines are updated.
Models exhibit lower accuracy on treatment protocols and clinical management than on symptom recognition.
The method extends to any domain whose decision logic can be expressed as structured relationships.
It supplies a maintainable alternative to static, manually curated benchmark datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar graph constructions could support evaluation in legal or regulatory domains that also rely on evolving rule sets.
The framework makes it feasible to track how model performance changes as guidelines themselves evolve over time.
Repeated regeneration could serve as a built-in check against models that overfit to any fixed test distribution.

Load-bearing premise

Expert guidelines can be translated into a graph that captures every relevant clinical relationship and decision without omission or distortion.

What would settle it

If regenerating questions from the same graph produces inconsistent accuracy patterns on identical underlying clinical logic, or if independent review finds that generated questions omit decision branches present in the source guidelines.

Figures

Figures reproduced from arXiv: 2508.20810 by Guillaume Chabot-Couture, Jessica M. Lundin, Usman Nasir Nakakana.

**Figure 2.** Figure 2: Accuracy delta heatmap showing the difference between question-type-specific accuracy [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: An IMCI flowchart [17] for cough or difficulty breathing assessment. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable. Static, manually curated datasets do not satisfy these properties. We present a graph-based evaluation harness that transforms structured clinical guidelines into a queryable knowledge graph and dynamically instantiates evaluation queries via graph traversal. The framework provides three guarantees: (1) complete coverage of guideline relationships; (2) surface-form contamination resistance through combinatorial variation; and (3) validity inherited from expert-authored graph structure. Applied to the WHO IMCI guidelines, the harness generates clinically grounded multiple-choice questions spanning symptom recognition, treatment, severity classification, and follow-up care. Evaluation across five language models reveals systematic capability gaps. Models perform well on symptom recognition but show lower accuracy on treatment protocols and clinical management decisions. The framework supports continuous regeneration of evaluation data as guidelines evolve and generalizes to domains with structured decision logic. This provides a scalable foundation for evaluation infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable graph method to generate fresh, guideline-based test questions for clinical LLMs but leaves the mapping from text to graph and the strength of its three guarantees under-specified.

read the letter

The core contribution is a harness that converts structured guidelines like WHO IMCI into a knowledge graph and then walks the graph to produce varied multiple-choice questions. This setup is meant to deliver full coverage of the guideline relations, resist surface contamination by recombining elements, and inherit correctness from the original expert text. They ran it on five models and report that symptom recognition is handled better than treatment choices or management decisions. That pattern matches what many people already suspect about current LLMs in medicine, so the result is at least directionally useful. The ability to regenerate the test set when guidelines are updated is also a practical advantage over static benchmarks. The main weakness is that the paper does not show how conditional branches, severity thresholds, or patient-context rules are turned into graph nodes and edges without flattening or dropping logic. If that step loses nuance, the completeness and validity claims rest on an untested assumption. The abstract mentions results but gives no error analysis, sample questions, or inter-rater checks on the generated items, so it is hard to judge how faithfully the questions reflect real clinical reasoning. The work is aimed at groups building domain-specific evaluations, especially in healthcare or other fields with written decision protocols. Readers who need concrete ways to keep benchmarks current and less contaminated will find the construction worth examining. It is coherent enough and addresses a genuine gap, so it should go to peer review rather than a desk reject; referees can press on the graph-construction details and ask for more quantitative validation of the guarantees.

Referee Report

2 major / 1 minor

Summary. The paper introduces a graph-based evaluation harness that converts structured clinical guidelines (exemplified by WHO IMCI) into a queryable knowledge graph. Evaluation queries are dynamically instantiated through graph traversal to produce multiple-choice questions covering symptom recognition, treatment, severity classification, and follow-up care. The framework asserts three guarantees: (1) complete coverage of guideline relationships, (2) surface-form contamination resistance via combinatorial variation, and (3) validity inherited from the expert-authored graph structure. It evaluates five language models, reports systematic capability gaps (stronger on symptom recognition, weaker on treatment protocols and clinical management), and emphasizes support for continuous regeneration as guidelines evolve, with generalization to other domains possessing structured decision logic.

Significance. If the central claims hold, the work provides a scalable template for domain-specific LLM evaluation that inherits validity from expert sources and resists static-dataset limitations such as contamination and obsolescence. This could meaningfully advance evaluation infrastructure in high-stakes fields like clinical decision support, where maintainable and auditable benchmarks are needed. The approach of deriving queries from an explicit graph rather than manual curation is a constructive direction, though its impact depends on demonstrating faithful encoding of guideline logic.

major comments (2)

[Abstract] Abstract, guarantees (1) and (3): The claims of complete coverage and validity inheritance rest on the assumption that the transformation from guideline text to graph structure encodes all conditional branches, severity thresholds, follow-up criteria, and patient-context factors without loss or distortion. No description of the graph schema, handling of non-relational logic, or completeness audit is supplied, which directly bears on whether the stated guarantees can be substantiated.
[Abstract] Abstract, evaluation paragraph: The statement that evaluation 'reveals systematic capability gaps' with lower accuracy on treatment protocols is presented without quantitative metrics, number of questions, model identities, or error analysis. This absence makes it impossible to assess the magnitude or reproducibility of the reported gaps, which are central to demonstrating the harness's practical value.

minor comments (1)

[Abstract] The abstract would be strengthened by briefly stating the number of generated questions and the exact models evaluated, providing readers with immediate context for the scale of the experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the focus on substantiating the core guarantees and improving the clarity of the evaluation claims. We respond to each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract, guarantees (1) and (3): The claims of complete coverage and validity inheritance rest on the assumption that the transformation from guideline text to graph structure encodes all conditional branches, severity thresholds, follow-up criteria, and patient-context factors without loss or distortion. No description of the graph schema, handling of non-relational logic, or completeness audit is supplied, which directly bears on whether the stated guarantees can be substantiated.

Authors: We agree that the abstract would benefit from a concise reference to the supporting details already present in the manuscript. Section 3.1–3.3 describes the graph schema (nodes for clinical entities such as symptoms, treatments, severity classes, and follow-up actions; directed edges for relations including 'indicates', 'treats', 'contraindicates', and 'follows_up'), the encoding of conditional logic via node attributes and parameterized traversal templates, and the completeness audit performed by systematic back-mapping of generated queries to the source WHO IMCI text. We will revise the abstract to include a brief clause summarizing the schema and audit approach, thereby making the guarantees more directly traceable without altering the manuscript's technical content. revision: yes
Referee: [Abstract] Abstract, evaluation paragraph: The statement that evaluation 'reveals systematic capability gaps' with lower accuracy on treatment protocols is presented without quantitative metrics, number of questions, model identities, or error analysis. This absence makes it impossible to assess the magnitude or reproducibility of the reported gaps, which are central to demonstrating the harness's practical value.

Authors: The abstract is intentionally high-level, with the full quantitative results, question counts, model identities, accuracy tables, and error analysis provided in Section 5. To improve self-containment, we will add a short clause to the abstract reporting the scale of the evaluation (number of questions generated) and the primary observed gap (e.g., higher accuracy on symptom recognition than on treatment selection), while retaining the detailed metrics and analysis in the body of the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external guidelines

full rationale

The paper constructs its evaluation harness by transforming external expert-authored clinical guidelines (e.g., WHO IMCI) into a knowledge graph, then uses graph traversal to generate queries. The three guarantees are presented as direct consequences of this construction and combinatorial variation rather than fitted predictions or self-referential definitions. No equations, self-citations, or ansatzes are invoked in a load-bearing manner that reduces the central claims to the paper's own inputs by construction. The method is self-contained against the external guideline source.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that clinical guidelines possess structured decision logic that can be losslessly encoded as a graph and that combinatorial traversal produces valid, non-contaminated queries.

axioms (1)

domain assumption Clinical guidelines can be accurately and completely represented as a directed graph that preserves all relationships and decision logic.
Invoked to support the guarantees of complete coverage and inherited validity.

pith-pipeline@v0.9.0 · 5703 in / 1115 out tokens · 48434 ms · 2026-05-21T23:19:00.835913+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We transform the WHO IMCI handbook into a directed graph structure... 200+ nodes and 300+ edges... five node types: Condition, Symptom, Treatment, FollowUp, Severity... Four edge types: INDICATES, TREAT, FOLLOW, TRIAGE
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

graph traversal to automatically generate MCQA that ensure complete coverage... 3.3+ trillion possible combinations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Medexpqa: Multilingual benchmarking of large language models for medical question answering.Artificial Intelligence in Medicine, 155:102938, 2024

Iñigo Alonso, Maite Oronoz, and Rodrigo Agerri. Medexpqa: Multilingual benchmarking of large language models for medical question answering.Artificial Intelligence in Medicine, 155:102938, 2024

work page 2024
[2]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health, 2025. 5

work page 2025
[3]

Ontology enrichment from texts: A biomedical dataset for concept discovery and placement

Hang Dong, Jiaoyan Chen, Yuan He, and Ian Horrocks. Ontology enrichment from texts: A biomedical dataset for concept discovery and placement. InProceedings of the 32nd ACM Inter- national Conference on Information and Knowledge Management, CIKM ’23, page 5316–5320, New York, NY, USA, 2023. Association for Computing Machinery

work page 2023
[4]

Measuring massive multitask language understanding.Proceedings of ICLR, 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of ICLR, 2021

work page 2021
[5]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021

work page 2021
[6]

Pubmedqa: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of EMNLP-IJCNLP, pages 2567–2577, 2019

work page 2019
[7]

Evaluating gpt-4 and chatgpt on japanese medical licensing examinations, 2023

Jungo Kasai, Yuhei Kasai, Keisuke Sakaguchi, Yutaro Yamada, and Dragomir Radev. Evaluating gpt-4 and chatgpt on japanese medical licensing examinations, 2023

work page 2023
[8]

Frenchmedmcqa: A french multiple-choice question answering dataset for medical domain

Yanis Labrak, Adrien Bazoge, Richard Dufour, Mickael Rouvier, Emmanuel Morin, Béatrice Daille, and Pierre-Antoine Gourraud. Frenchmedmcqa: A french multiple-choice question answering dataset for medical domain. InProceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), pages 41–46, Abu Dhabi, United Arab Emira...

work page 2022
[9]

Sequential diagnosis with language models, 2025

Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, Bay Gross, Peter Hames, Mustafa Suleyman, Dominic King, and Eric Horvitz. Sequential diagnosis with language models, 2025

work page 2025
[10]

Afrimed-qa: A pan-african, multi-specialty, medical question-answering benchmark dataset

Tobi Olatunji, Abraham Owodunni, Tassallah Abdullahi, Ayokunmi Ilesanmi, Olalekan Obadun, Aimérou Ndiaye Etori, Ifeoma Okoh, Evans Doe Ocansey, Wendy Kinara, Michael Best, et al. Afrimed-qa: A pan-african, multi-specialty, medical question-answering benchmark dataset. arXiv preprint arXiv:2411.15640, 2024

work page arXiv 2024
[11]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on Health, Inference, and Learning, pages 248–260. PMLR, 2022

work page 2022
[12]

Med-halt: Medical domain hallucination test for large language models, 2023

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Med-halt: Medical domain hallucination test for large language models, 2023

work page 2023
[13]

emrqa: A large corpus for question answering on electronic medical records

Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. emrqa: A large corpus for question answering on electronic medical records. InProceedings of EMNLP, pages 2357–2368, 2018

work page 2018
[14]

Radqa: A question answering dataset to improve comprehension of radiology reports

Sarvesh Soni, Meghana Gudala, Atieh Pajouhi, and Kirk Roberts. Radqa: A question answering dataset to improve comprehension of radiology reports. In Nicoletta Calzolari, Frédéric Béchet, et al., editors,Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6250–6259, Marseille, France, jun 2022. European Language Resources Asso...

work page 2022
[15]

Towards conversational diagnostic ai.Nature, 629(8010):331–338, 2024

Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tomasev, Shekoofeh Azizi, Karan Singhal, Yong Cheng, Le Hou, Albert Webson, Kavita Kulkarni, S Sara Mahdavi, Christopher Semturs, Juraj Gottweis, Joelle Barral, Katherine Chou, Greg S Corrado, Yossi Matias, Alan Karthikesalingam, and ...

work page 2024
[16]

HEAD-QA: A healthcare dataset for complex reasoning

David Vilares and Carlos Gómez-Rodríguez. HEAD-QA: A healthcare dataset for complex reasoning. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 960–966, Florence, Italy, jul 2019. Association for Computational Linguistics

work page 2019
[17]

Integrated management of childhood illness - chart booklet

WHO Team, Child Health and Development (CHD). Integrated management of childhood illness - chart booklet. Technical report, World Health Organization, March 2014. Technical document

work page 2014
[18]

0-2” or “2-60

Sheng Zhang, Xin Zhang, Hui Wang, Jiajun Cheng, Pei Li, and Zhaoyun Ding. Chinese medical question answer matching using end-to-end character-level multi-scale cnns.Applied Sciences, 7(8), 2017. A Model Performance Analysis by Question Type Figure 2 presents a view of model performance variations across different clinical question types, measured as the d...

work page 2017

[1] [1]

Medexpqa: Multilingual benchmarking of large language models for medical question answering.Artificial Intelligence in Medicine, 155:102938, 2024

Iñigo Alonso, Maite Oronoz, and Rodrigo Agerri. Medexpqa: Multilingual benchmarking of large language models for medical question answering.Artificial Intelligence in Medicine, 155:102938, 2024

work page 2024

[2] [2]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health, 2025. 5

work page 2025

[3] [3]

Ontology enrichment from texts: A biomedical dataset for concept discovery and placement

Hang Dong, Jiaoyan Chen, Yuan He, and Ian Horrocks. Ontology enrichment from texts: A biomedical dataset for concept discovery and placement. InProceedings of the 32nd ACM Inter- national Conference on Information and Knowledge Management, CIKM ’23, page 5316–5320, New York, NY, USA, 2023. Association for Computing Machinery

work page 2023

[4] [4]

Measuring massive multitask language understanding.Proceedings of ICLR, 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of ICLR, 2021

work page 2021

[5] [5]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021

work page 2021

[6] [6]

Pubmedqa: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of EMNLP-IJCNLP, pages 2567–2577, 2019

work page 2019

[7] [7]

Evaluating gpt-4 and chatgpt on japanese medical licensing examinations, 2023

Jungo Kasai, Yuhei Kasai, Keisuke Sakaguchi, Yutaro Yamada, and Dragomir Radev. Evaluating gpt-4 and chatgpt on japanese medical licensing examinations, 2023

work page 2023

[8] [8]

Frenchmedmcqa: A french multiple-choice question answering dataset for medical domain

Yanis Labrak, Adrien Bazoge, Richard Dufour, Mickael Rouvier, Emmanuel Morin, Béatrice Daille, and Pierre-Antoine Gourraud. Frenchmedmcqa: A french multiple-choice question answering dataset for medical domain. InProceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), pages 41–46, Abu Dhabi, United Arab Emira...

work page 2022

[9] [9]

Sequential diagnosis with language models, 2025

Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, Bay Gross, Peter Hames, Mustafa Suleyman, Dominic King, and Eric Horvitz. Sequential diagnosis with language models, 2025

work page 2025

[10] [10]

Afrimed-qa: A pan-african, multi-specialty, medical question-answering benchmark dataset

Tobi Olatunji, Abraham Owodunni, Tassallah Abdullahi, Ayokunmi Ilesanmi, Olalekan Obadun, Aimérou Ndiaye Etori, Ifeoma Okoh, Evans Doe Ocansey, Wendy Kinara, Michael Best, et al. Afrimed-qa: A pan-african, multi-specialty, medical question-answering benchmark dataset. arXiv preprint arXiv:2411.15640, 2024

work page arXiv 2024

[11] [11]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on Health, Inference, and Learning, pages 248–260. PMLR, 2022

work page 2022

[12] [12]

Med-halt: Medical domain hallucination test for large language models, 2023

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Med-halt: Medical domain hallucination test for large language models, 2023

work page 2023

[13] [13]

emrqa: A large corpus for question answering on electronic medical records

Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. emrqa: A large corpus for question answering on electronic medical records. InProceedings of EMNLP, pages 2357–2368, 2018

work page 2018

[14] [14]

Radqa: A question answering dataset to improve comprehension of radiology reports

Sarvesh Soni, Meghana Gudala, Atieh Pajouhi, and Kirk Roberts. Radqa: A question answering dataset to improve comprehension of radiology reports. In Nicoletta Calzolari, Frédéric Béchet, et al., editors,Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6250–6259, Marseille, France, jun 2022. European Language Resources Asso...

work page 2022

[15] [15]

Towards conversational diagnostic ai.Nature, 629(8010):331–338, 2024

Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tomasev, Shekoofeh Azizi, Karan Singhal, Yong Cheng, Le Hou, Albert Webson, Kavita Kulkarni, S Sara Mahdavi, Christopher Semturs, Juraj Gottweis, Joelle Barral, Katherine Chou, Greg S Corrado, Yossi Matias, Alan Karthikesalingam, and ...

work page 2024

[16] [16]

HEAD-QA: A healthcare dataset for complex reasoning

David Vilares and Carlos Gómez-Rodríguez. HEAD-QA: A healthcare dataset for complex reasoning. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 960–966, Florence, Italy, jul 2019. Association for Computational Linguistics

work page 2019

[17] [17]

Integrated management of childhood illness - chart booklet

WHO Team, Child Health and Development (CHD). Integrated management of childhood illness - chart booklet. Technical report, World Health Organization, March 2014. Technical document

work page 2014

[18] [18]

0-2” or “2-60

Sheng Zhang, Xin Zhang, Hui Wang, Jiajun Cheng, Pei Li, and Zhaoyun Ding. Chinese medical question answer matching using end-to-end character-level multi-scale cnns.Applied Sciences, 7(8), 2017. A Model Performance Analysis by Question Type Figure 2 presents a view of model performance variations across different clinical question types, measured as the d...

work page 2017