pith. sign in

arxiv: 2508.20810 · v3 · pith:MKLIFI23new · submitted 2025-08-28 · 💻 cs.AI · cs.CL

From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs

Pith reviewed 2026-05-21 23:19 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords graph-based evaluationLLM benchmarkingclinical guidelinescontamination resistanceknowledge graphdomain-specific evaluationdynamic benchmark generation
0
0 comments X

The pith

A graph derived from clinical guidelines generates evaluation questions that guarantee full coverage, contamination resistance, and expert validity for testing LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method that converts structured guidelines into a knowledge graph and uses graph traversal to create evaluation queries on demand. This produces benchmarks that cover every relationship in the guidelines, vary surface forms to block memorization, and inherit correctness from the original expert source. Applied to WHO IMCI guidelines, the generated questions expose that current models handle symptom identification more reliably than treatment choices or management steps. The approach also allows the entire test set to be regenerated whenever the guidelines are revised.

Core claim

By representing clinical guidelines as a queryable knowledge graph, the harness delivers three guarantees: exhaustive coverage of all guideline relationships through traversal, resistance to surface-form contamination via combinatorial question generation, and retained validity because the graph structure is built directly from expert-authored logic. When used to create multiple-choice questions on symptom recognition, treatment, severity classification, and follow-up, the resulting evaluations reveal consistent performance differences across models on the WHO IMCI guidelines.

What carries the argument

The queryable knowledge graph built from expert guidelines, traversed to instantiate new evaluation queries with combinatorial variation.

If this is right

  • Evaluation data can be regenerated on demand whenever guidelines are updated.
  • Models exhibit lower accuracy on treatment protocols and clinical management than on symptom recognition.
  • The method extends to any domain whose decision logic can be expressed as structured relationships.
  • It supplies a maintainable alternative to static, manually curated benchmark datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar graph constructions could support evaluation in legal or regulatory domains that also rely on evolving rule sets.
  • The framework makes it feasible to track how model performance changes as guidelines themselves evolve over time.
  • Repeated regeneration could serve as a built-in check against models that overfit to any fixed test distribution.

Load-bearing premise

Expert guidelines can be translated into a graph that captures every relevant clinical relationship and decision without omission or distortion.

What would settle it

If regenerating questions from the same graph produces inconsistent accuracy patterns on identical underlying clinical logic, or if independent review finds that generated questions omit decision branches present in the source guidelines.

Figures

Figures reproduced from arXiv: 2508.20810 by Guillaume Chabot-Couture, Jessica M. Lundin, Usman Nasir Nakakana.

Figure 1
Figure 1. Figure 1: Model accuracy across different question types with 95% confidence intervals. The figure [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy delta heatmap showing the difference between question-type-specific accuracy [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An IMCI flowchart [17] for cough or difficulty breathing assessment. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable. Static, manually curated datasets do not satisfy these properties. We present a graph-based evaluation harness that transforms structured clinical guidelines into a queryable knowledge graph and dynamically instantiates evaluation queries via graph traversal. The framework provides three guarantees: (1) complete coverage of guideline relationships; (2) surface-form contamination resistance through combinatorial variation; and (3) validity inherited from expert-authored graph structure. Applied to the WHO IMCI guidelines, the harness generates clinically grounded multiple-choice questions spanning symptom recognition, treatment, severity classification, and follow-up care. Evaluation across five language models reveals systematic capability gaps. Models perform well on symptom recognition but show lower accuracy on treatment protocols and clinical management decisions. The framework supports continuous regeneration of evaluation data as guidelines evolve and generalizes to domains with structured decision logic. This provides a scalable foundation for evaluation infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a graph-based evaluation harness that converts structured clinical guidelines (exemplified by WHO IMCI) into a queryable knowledge graph. Evaluation queries are dynamically instantiated through graph traversal to produce multiple-choice questions covering symptom recognition, treatment, severity classification, and follow-up care. The framework asserts three guarantees: (1) complete coverage of guideline relationships, (2) surface-form contamination resistance via combinatorial variation, and (3) validity inherited from the expert-authored graph structure. It evaluates five language models, reports systematic capability gaps (stronger on symptom recognition, weaker on treatment protocols and clinical management), and emphasizes support for continuous regeneration as guidelines evolve, with generalization to other domains possessing structured decision logic.

Significance. If the central claims hold, the work provides a scalable template for domain-specific LLM evaluation that inherits validity from expert sources and resists static-dataset limitations such as contamination and obsolescence. This could meaningfully advance evaluation infrastructure in high-stakes fields like clinical decision support, where maintainable and auditable benchmarks are needed. The approach of deriving queries from an explicit graph rather than manual curation is a constructive direction, though its impact depends on demonstrating faithful encoding of guideline logic.

major comments (2)
  1. [Abstract] Abstract, guarantees (1) and (3): The claims of complete coverage and validity inheritance rest on the assumption that the transformation from guideline text to graph structure encodes all conditional branches, severity thresholds, follow-up criteria, and patient-context factors without loss or distortion. No description of the graph schema, handling of non-relational logic, or completeness audit is supplied, which directly bears on whether the stated guarantees can be substantiated.
  2. [Abstract] Abstract, evaluation paragraph: The statement that evaluation 'reveals systematic capability gaps' with lower accuracy on treatment protocols is presented without quantitative metrics, number of questions, model identities, or error analysis. This absence makes it impossible to assess the magnitude or reproducibility of the reported gaps, which are central to demonstrating the harness's practical value.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by briefly stating the number of generated questions and the exact models evaluated, providing readers with immediate context for the scale of the experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the focus on substantiating the core guarantees and improving the clarity of the evaluation claims. We respond to each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract, guarantees (1) and (3): The claims of complete coverage and validity inheritance rest on the assumption that the transformation from guideline text to graph structure encodes all conditional branches, severity thresholds, follow-up criteria, and patient-context factors without loss or distortion. No description of the graph schema, handling of non-relational logic, or completeness audit is supplied, which directly bears on whether the stated guarantees can be substantiated.

    Authors: We agree that the abstract would benefit from a concise reference to the supporting details already present in the manuscript. Section 3.1–3.3 describes the graph schema (nodes for clinical entities such as symptoms, treatments, severity classes, and follow-up actions; directed edges for relations including 'indicates', 'treats', 'contraindicates', and 'follows_up'), the encoding of conditional logic via node attributes and parameterized traversal templates, and the completeness audit performed by systematic back-mapping of generated queries to the source WHO IMCI text. We will revise the abstract to include a brief clause summarizing the schema and audit approach, thereby making the guarantees more directly traceable without altering the manuscript's technical content. revision: yes

  2. Referee: [Abstract] Abstract, evaluation paragraph: The statement that evaluation 'reveals systematic capability gaps' with lower accuracy on treatment protocols is presented without quantitative metrics, number of questions, model identities, or error analysis. This absence makes it impossible to assess the magnitude or reproducibility of the reported gaps, which are central to demonstrating the harness's practical value.

    Authors: The abstract is intentionally high-level, with the full quantitative results, question counts, model identities, accuracy tables, and error analysis provided in Section 5. To improve self-containment, we will add a short clause to the abstract reporting the scale of the evaluation (number of questions generated) and the primary observed gap (e.g., higher accuracy on symptom recognition than on treatment selection), while retaining the detailed metrics and analysis in the body of the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external guidelines

full rationale

The paper constructs its evaluation harness by transforming external expert-authored clinical guidelines (e.g., WHO IMCI) into a knowledge graph, then uses graph traversal to generate queries. The three guarantees are presented as direct consequences of this construction and combinatorial variation rather than fitted predictions or self-referential definitions. No equations, self-citations, or ansatzes are invoked in a load-bearing manner that reduces the central claims to the paper's own inputs by construction. The method is self-contained against the external guideline source.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that clinical guidelines possess structured decision logic that can be losslessly encoded as a graph and that combinatorial traversal produces valid, non-contaminated queries.

axioms (1)
  • domain assumption Clinical guidelines can be accurately and completely represented as a directed graph that preserves all relationships and decision logic.
    Invoked to support the guarantees of complete coverage and inherited validity.

pith-pipeline@v0.9.0 · 5703 in / 1115 out tokens · 48434 ms · 2026-05-21T23:19:00.835913+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    Medexpqa: Multilingual benchmarking of large language models for medical question answering.Artificial Intelligence in Medicine, 155:102938, 2024

    Iñigo Alonso, Maite Oronoz, and Rodrigo Agerri. Medexpqa: Multilingual benchmarking of large language models for medical question answering.Artificial Intelligence in Medicine, 155:102938, 2024

  2. [2]

    Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health, 2025. 5

  3. [3]

    Ontology enrichment from texts: A biomedical dataset for concept discovery and placement

    Hang Dong, Jiaoyan Chen, Yuan He, and Ian Horrocks. Ontology enrichment from texts: A biomedical dataset for concept discovery and placement. InProceedings of the 32nd ACM Inter- national Conference on Information and Knowledge Management, CIKM ’23, page 5316–5320, New York, NY, USA, 2023. Association for Computing Machinery

  4. [4]

    Measuring massive multitask language understanding.Proceedings of ICLR, 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of ICLR, 2021

  5. [5]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021

  6. [6]

    Pubmedqa: A dataset for biomedical research question answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of EMNLP-IJCNLP, pages 2567–2577, 2019

  7. [7]

    Evaluating gpt-4 and chatgpt on japanese medical licensing examinations, 2023

    Jungo Kasai, Yuhei Kasai, Keisuke Sakaguchi, Yutaro Yamada, and Dragomir Radev. Evaluating gpt-4 and chatgpt on japanese medical licensing examinations, 2023

  8. [8]

    Frenchmedmcqa: A french multiple-choice question answering dataset for medical domain

    Yanis Labrak, Adrien Bazoge, Richard Dufour, Mickael Rouvier, Emmanuel Morin, Béatrice Daille, and Pierre-Antoine Gourraud. Frenchmedmcqa: A french multiple-choice question answering dataset for medical domain. InProceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), pages 41–46, Abu Dhabi, United Arab Emira...

  9. [9]

    Sequential diagnosis with language models, 2025

    Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, Bay Gross, Peter Hames, Mustafa Suleyman, Dominic King, and Eric Horvitz. Sequential diagnosis with language models, 2025

  10. [10]

    Afrimed-qa: A pan-african, multi-specialty, medical question-answering benchmark dataset

    Tobi Olatunji, Abraham Owodunni, Tassallah Abdullahi, Ayokunmi Ilesanmi, Olalekan Obadun, Aimérou Ndiaye Etori, Ifeoma Okoh, Evans Doe Ocansey, Wendy Kinara, Michael Best, et al. Afrimed-qa: A pan-african, multi-specialty, medical question-answering benchmark dataset. arXiv preprint arXiv:2411.15640, 2024

  11. [11]

    Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on Health, Inference, and Learning, pages 248–260. PMLR, 2022

  12. [12]

    Med-halt: Medical domain hallucination test for large language models, 2023

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Med-halt: Medical domain hallucination test for large language models, 2023

  13. [13]

    emrqa: A large corpus for question answering on electronic medical records

    Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. emrqa: A large corpus for question answering on electronic medical records. InProceedings of EMNLP, pages 2357–2368, 2018

  14. [14]

    Radqa: A question answering dataset to improve comprehension of radiology reports

    Sarvesh Soni, Meghana Gudala, Atieh Pajouhi, and Kirk Roberts. Radqa: A question answering dataset to improve comprehension of radiology reports. In Nicoletta Calzolari, Frédéric Béchet, et al., editors,Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6250–6259, Marseille, France, jun 2022. European Language Resources Asso...

  15. [15]

    Towards conversational diagnostic ai.Nature, 629(8010):331–338, 2024

    Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tomasev, Shekoofeh Azizi, Karan Singhal, Yong Cheng, Le Hou, Albert Webson, Kavita Kulkarni, S Sara Mahdavi, Christopher Semturs, Juraj Gottweis, Joelle Barral, Katherine Chou, Greg S Corrado, Yossi Matias, Alan Karthikesalingam, and ...

  16. [16]

    HEAD-QA: A healthcare dataset for complex reasoning

    David Vilares and Carlos Gómez-Rodríguez. HEAD-QA: A healthcare dataset for complex reasoning. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 960–966, Florence, Italy, jul 2019. Association for Computational Linguistics

  17. [17]

    Integrated management of childhood illness - chart booklet

    WHO Team, Child Health and Development (CHD). Integrated management of childhood illness - chart booklet. Technical report, World Health Organization, March 2014. Technical document

  18. [18]

    0-2” or “2-60

    Sheng Zhang, Xin Zhang, Hui Wang, Jiajun Cheng, Pei Li, and Zhaoyun Ding. Chinese medical question answer matching using end-to-end character-level multi-scale cnns.Applied Sciences, 7(8), 2017. A Model Performance Analysis by Question Type Figure 2 presents a view of model performance variations across different clinical question types, measured as the d...