From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs
Pith reviewed 2026-05-21 23:19 UTC · model grok-4.3
The pith
A graph derived from clinical guidelines generates evaluation questions that guarantee full coverage, contamination resistance, and expert validity for testing LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By representing clinical guidelines as a queryable knowledge graph, the harness delivers three guarantees: exhaustive coverage of all guideline relationships through traversal, resistance to surface-form contamination via combinatorial question generation, and retained validity because the graph structure is built directly from expert-authored logic. When used to create multiple-choice questions on symptom recognition, treatment, severity classification, and follow-up, the resulting evaluations reveal consistent performance differences across models on the WHO IMCI guidelines.
What carries the argument
The queryable knowledge graph built from expert guidelines, traversed to instantiate new evaluation queries with combinatorial variation.
If this is right
- Evaluation data can be regenerated on demand whenever guidelines are updated.
- Models exhibit lower accuracy on treatment protocols and clinical management than on symptom recognition.
- The method extends to any domain whose decision logic can be expressed as structured relationships.
- It supplies a maintainable alternative to static, manually curated benchmark datasets.
Where Pith is reading between the lines
- Similar graph constructions could support evaluation in legal or regulatory domains that also rely on evolving rule sets.
- The framework makes it feasible to track how model performance changes as guidelines themselves evolve over time.
- Repeated regeneration could serve as a built-in check against models that overfit to any fixed test distribution.
Load-bearing premise
Expert guidelines can be translated into a graph that captures every relevant clinical relationship and decision without omission or distortion.
What would settle it
If regenerating questions from the same graph produces inconsistent accuracy patterns on identical underlying clinical logic, or if independent review finds that generated questions omit decision branches present in the source guidelines.
Figures
read the original abstract
Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable. Static, manually curated datasets do not satisfy these properties. We present a graph-based evaluation harness that transforms structured clinical guidelines into a queryable knowledge graph and dynamically instantiates evaluation queries via graph traversal. The framework provides three guarantees: (1) complete coverage of guideline relationships; (2) surface-form contamination resistance through combinatorial variation; and (3) validity inherited from expert-authored graph structure. Applied to the WHO IMCI guidelines, the harness generates clinically grounded multiple-choice questions spanning symptom recognition, treatment, severity classification, and follow-up care. Evaluation across five language models reveals systematic capability gaps. Models perform well on symptom recognition but show lower accuracy on treatment protocols and clinical management decisions. The framework supports continuous regeneration of evaluation data as guidelines evolve and generalizes to domains with structured decision logic. This provides a scalable foundation for evaluation infrastructure.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a graph-based evaluation harness that converts structured clinical guidelines (exemplified by WHO IMCI) into a queryable knowledge graph. Evaluation queries are dynamically instantiated through graph traversal to produce multiple-choice questions covering symptom recognition, treatment, severity classification, and follow-up care. The framework asserts three guarantees: (1) complete coverage of guideline relationships, (2) surface-form contamination resistance via combinatorial variation, and (3) validity inherited from the expert-authored graph structure. It evaluates five language models, reports systematic capability gaps (stronger on symptom recognition, weaker on treatment protocols and clinical management), and emphasizes support for continuous regeneration as guidelines evolve, with generalization to other domains possessing structured decision logic.
Significance. If the central claims hold, the work provides a scalable template for domain-specific LLM evaluation that inherits validity from expert sources and resists static-dataset limitations such as contamination and obsolescence. This could meaningfully advance evaluation infrastructure in high-stakes fields like clinical decision support, where maintainable and auditable benchmarks are needed. The approach of deriving queries from an explicit graph rather than manual curation is a constructive direction, though its impact depends on demonstrating faithful encoding of guideline logic.
major comments (2)
- [Abstract] Abstract, guarantees (1) and (3): The claims of complete coverage and validity inheritance rest on the assumption that the transformation from guideline text to graph structure encodes all conditional branches, severity thresholds, follow-up criteria, and patient-context factors without loss or distortion. No description of the graph schema, handling of non-relational logic, or completeness audit is supplied, which directly bears on whether the stated guarantees can be substantiated.
- [Abstract] Abstract, evaluation paragraph: The statement that evaluation 'reveals systematic capability gaps' with lower accuracy on treatment protocols is presented without quantitative metrics, number of questions, model identities, or error analysis. This absence makes it impossible to assess the magnitude or reproducibility of the reported gaps, which are central to demonstrating the harness's practical value.
minor comments (1)
- [Abstract] The abstract would be strengthened by briefly stating the number of generated questions and the exact models evaluated, providing readers with immediate context for the scale of the experiments.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the focus on substantiating the core guarantees and improving the clarity of the evaluation claims. We respond to each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract, guarantees (1) and (3): The claims of complete coverage and validity inheritance rest on the assumption that the transformation from guideline text to graph structure encodes all conditional branches, severity thresholds, follow-up criteria, and patient-context factors without loss or distortion. No description of the graph schema, handling of non-relational logic, or completeness audit is supplied, which directly bears on whether the stated guarantees can be substantiated.
Authors: We agree that the abstract would benefit from a concise reference to the supporting details already present in the manuscript. Section 3.1–3.3 describes the graph schema (nodes for clinical entities such as symptoms, treatments, severity classes, and follow-up actions; directed edges for relations including 'indicates', 'treats', 'contraindicates', and 'follows_up'), the encoding of conditional logic via node attributes and parameterized traversal templates, and the completeness audit performed by systematic back-mapping of generated queries to the source WHO IMCI text. We will revise the abstract to include a brief clause summarizing the schema and audit approach, thereby making the guarantees more directly traceable without altering the manuscript's technical content. revision: yes
-
Referee: [Abstract] Abstract, evaluation paragraph: The statement that evaluation 'reveals systematic capability gaps' with lower accuracy on treatment protocols is presented without quantitative metrics, number of questions, model identities, or error analysis. This absence makes it impossible to assess the magnitude or reproducibility of the reported gaps, which are central to demonstrating the harness's practical value.
Authors: The abstract is intentionally high-level, with the full quantitative results, question counts, model identities, accuracy tables, and error analysis provided in Section 5. To improve self-containment, we will add a short clause to the abstract reporting the scale of the evaluation (number of questions generated) and the primary observed gap (e.g., higher accuracy on symptom recognition than on treatment selection), while retaining the detailed metrics and analysis in the body of the paper. revision: yes
Circularity Check
No significant circularity; derivation relies on external guidelines
full rationale
The paper constructs its evaluation harness by transforming external expert-authored clinical guidelines (e.g., WHO IMCI) into a knowledge graph, then uses graph traversal to generate queries. The three guarantees are presented as direct consequences of this construction and combinatorial variation rather than fitted predictions or self-referential definitions. No equations, self-citations, or ansatzes are invoked in a load-bearing manner that reduces the central claims to the paper's own inputs by construction. The method is self-contained against the external guideline source.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Clinical guidelines can be accurately and completely represented as a directed graph that preserves all relationships and decision logic.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We transform the WHO IMCI handbook into a directed graph structure... 200+ nodes and 300+ edges... five node types: Condition, Symptom, Treatment, FollowUp, Severity... Four edge types: INDICATES, TREAT, FOLLOW, TRIAGE
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
graph traversal to automatically generate MCQA that ensure complete coverage... 3.3+ trillion possible combinations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Iñigo Alonso, Maite Oronoz, and Rodrigo Agerri. Medexpqa: Multilingual benchmarking of large language models for medical question answering.Artificial Intelligence in Medicine, 155:102938, 2024
work page 2024
-
[2]
Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health, 2025. 5
work page 2025
-
[3]
Ontology enrichment from texts: A biomedical dataset for concept discovery and placement
Hang Dong, Jiaoyan Chen, Yuan He, and Ian Horrocks. Ontology enrichment from texts: A biomedical dataset for concept discovery and placement. InProceedings of the 32nd ACM Inter- national Conference on Information and Knowledge Management, CIKM ’23, page 5316–5320, New York, NY, USA, 2023. Association for Computing Machinery
work page 2023
-
[4]
Measuring massive multitask language understanding.Proceedings of ICLR, 2021
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of ICLR, 2021
work page 2021
-
[5]
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021
work page 2021
-
[6]
Pubmedqa: A dataset for biomedical research question answering
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of EMNLP-IJCNLP, pages 2567–2577, 2019
work page 2019
-
[7]
Evaluating gpt-4 and chatgpt on japanese medical licensing examinations, 2023
Jungo Kasai, Yuhei Kasai, Keisuke Sakaguchi, Yutaro Yamada, and Dragomir Radev. Evaluating gpt-4 and chatgpt on japanese medical licensing examinations, 2023
work page 2023
-
[8]
Frenchmedmcqa: A french multiple-choice question answering dataset for medical domain
Yanis Labrak, Adrien Bazoge, Richard Dufour, Mickael Rouvier, Emmanuel Morin, Béatrice Daille, and Pierre-Antoine Gourraud. Frenchmedmcqa: A french multiple-choice question answering dataset for medical domain. InProceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), pages 41–46, Abu Dhabi, United Arab Emira...
work page 2022
-
[9]
Sequential diagnosis with language models, 2025
Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, Bay Gross, Peter Hames, Mustafa Suleyman, Dominic King, and Eric Horvitz. Sequential diagnosis with language models, 2025
work page 2025
-
[10]
Afrimed-qa: A pan-african, multi-specialty, medical question-answering benchmark dataset
Tobi Olatunji, Abraham Owodunni, Tassallah Abdullahi, Ayokunmi Ilesanmi, Olalekan Obadun, Aimérou Ndiaye Etori, Ifeoma Okoh, Evans Doe Ocansey, Wendy Kinara, Michael Best, et al. Afrimed-qa: A pan-african, multi-specialty, medical question-answering benchmark dataset. arXiv preprint arXiv:2411.15640, 2024
-
[11]
Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering
Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on Health, Inference, and Learning, pages 248–260. PMLR, 2022
work page 2022
-
[12]
Med-halt: Medical domain hallucination test for large language models, 2023
Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Med-halt: Medical domain hallucination test for large language models, 2023
work page 2023
-
[13]
emrqa: A large corpus for question answering on electronic medical records
Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. emrqa: A large corpus for question answering on electronic medical records. InProceedings of EMNLP, pages 2357–2368, 2018
work page 2018
-
[14]
Radqa: A question answering dataset to improve comprehension of radiology reports
Sarvesh Soni, Meghana Gudala, Atieh Pajouhi, and Kirk Roberts. Radqa: A question answering dataset to improve comprehension of radiology reports. In Nicoletta Calzolari, Frédéric Béchet, et al., editors,Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6250–6259, Marseille, France, jun 2022. European Language Resources Asso...
work page 2022
-
[15]
Towards conversational diagnostic ai.Nature, 629(8010):331–338, 2024
Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tomasev, Shekoofeh Azizi, Karan Singhal, Yong Cheng, Le Hou, Albert Webson, Kavita Kulkarni, S Sara Mahdavi, Christopher Semturs, Juraj Gottweis, Joelle Barral, Katherine Chou, Greg S Corrado, Yossi Matias, Alan Karthikesalingam, and ...
work page 2024
-
[16]
HEAD-QA: A healthcare dataset for complex reasoning
David Vilares and Carlos Gómez-Rodríguez. HEAD-QA: A healthcare dataset for complex reasoning. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 960–966, Florence, Italy, jul 2019. Association for Computational Linguistics
work page 2019
-
[17]
Integrated management of childhood illness - chart booklet
WHO Team, Child Health and Development (CHD). Integrated management of childhood illness - chart booklet. Technical report, World Health Organization, March 2014. Technical document
work page 2014
-
[18]
Sheng Zhang, Xin Zhang, Hui Wang, Jiajun Cheng, Pei Li, and Zhaoyun Ding. Chinese medical question answer matching using end-to-end character-level multi-scale cnns.Applied Sciences, 7(8), 2017. A Model Performance Analysis by Question Type Figure 2 presents a view of model performance variations across different clinical question types, measured as the d...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.