arxiv: 2605.11143 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI· cs.IR

Recognition: 2 theorem links

· Lean Theorem

ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV

Alex Stinard

Authors on Pith no claims yet

Pith reviewed 2026-05-13 03:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords clinical question answeringknowledge graph retrievalMIMIC-IVassertion labelsnegation handlingelectronic health recordsRAG evaluationbenchmark

0 comments

The pith

Assertion-aware retrieval from knowledge graphs improves clinical QA accuracy by 22 percentage points over dense retrieval on MIMIC-IV notes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Clinical question answering often fails when retrieval ignores whether a fact is negated, when it occurred, or whether it applies to the patient. The paper tests retrieval that carries assertion labels and temporality tags inside a patient knowledge graph and routes results by question intent. It releases ClinicalBench, a 400-question set over 43 MIMIC-IV patients spanning nine categories sensitive to these distinctions. On a physician-adjudicated subset of 50 unanimous items the new approach raises exact-match accuracy by 22 points with statistical significance. The gain appears across six language models but narrows as the baseline model grows stronger.

Core claim

EpiKG attaches an assertion label and temporality tag to every fact in a patient knowledge graph and routes retrieval according to question intent. Intent-aware KG-RAG improves over a Contriever dense-RAG baseline by 8.84 percentage points on the main endpoint and by 12.43 points under oracle intent. The author-blind primary endpoint on 50 unanimous-strict items rated by two external physicians shows a +22.0 percentage point lift (95 percent Newcombe CI [+5.1, +31.5], p=0.0192). Physician review found 56 percent of auto-generated reference answers defective.

What carries the argument

EpiKG, the patient knowledge graph that labels each fact with an assertion and temporality tag and routes retrieval by question intent.

If this is right

Retrieval that ignores assertion and temporality produces answers contradicted by negation or wrong time in EHR notes.
The accuracy gain is reproducible with a deterministic keyword proxy and holds directionally under three-rater majority.
Performance improvement shrinks as the standalone LLM baseline rises, with a strong negative correlation across models.
Over half of pipeline-generated reference answers contain defects that require physician adjudication for usable benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Assertion-aware routing could help question answering in other domains that routinely handle negated or time-bound facts.
The public release of ClinicalBench and the EpiKG outputs lets other groups test new retrieval methods on the same clinical questions.
As language models continue to improve, specialized retrieval may become less decisive unless the benchmark expands to harder cases.

Load-bearing premise

The nine assertion-sensitive categories and the 43 selected MIMIC-IV patients capture the main real-world failure modes of clinical QA, and the external physician ratings on the 50-item subset are sufficiently consistent and representative.

What would settle it

A replication study on a new set of patients or a different hospital system that finds no significant accuracy difference between assertion-aware KG-RAG and dense retrieval would falsify the central result.

Figures

Figures reproduced from arXiv: 2605.11143 by Alex Stinard.

**Figure 2.** Figure 2: Worked example with retrieved knowledge subgraphs. A [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation results. (a) Waterfall chart showing incremental accuracy changes (Opus, 𝑛 = 400). Retrieval provides the largest gain (+30.2 pp); switching from flat to KG-structured retrieval is neutral (−2.0 pp); assertions without routing hurt (−3.8 pp); keyword routing recovers and extends (+14.0 pp); oracle routing adds +8.3 pp. (b) Qualitative progression of model answers across conditions. (c) KG-RAG bene… view at source ↗

**Figure 4.** Figure 4: Oracle vs. keyword-only intent routing across [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: ClinicalBench per-category accuracies across six conditions (Claude Opus 4.6, evaluator v2). This plot is [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: SliceBench exploratory results (𝑛 = 2 patients per tier, 24 questions each). Tier-specific B2→B3 deltas are point estimates only and should not be over-interpreted; the overall B2→B3 delta is +2.2 pp (95% CI [-1.5 pp, +5.9 pp]). 24 [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Per-category deltas for three paired contrasts (C3 vs. C4, C4 vs. C4g, and C1 vs. C4g). These contrasts are [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Keyword intent classifier analysis (Opus). [PITH_FULL_IMAGE:figures/full_fig_p036_8.png] view at source ↗

read the original abstract

Reasoning benchmarks measure clinical performance on clean inputs. We evaluate the step before reasoning: retrieval over real EHR notes, where negation, temporality, and family-versus-patient attribution can flip a correct answer to a wrong one. EpiKG carries an assertion label and a temporality tag with every fact in a patient knowledge graph, then routes retrieval by question intent. ClinicalBench is a 400-question test over 43 MIMIC-IV patients across 9 assertion-sensitive categories. A 7-condition ablation tests each piece of EpiKG across six LLMs (Claude Opus 4.6, GPT-OSS 20B, MedGemma 27B, Gemma 4 31B, MedGemma 1.5 4B, Qwen 3.5 35B). Three physicians blindly adjudicated 100 paired items. The author-blind primary endpoint, leave-author-out paired exact McNemar on 50 unanimous-strict items rated by two external physicians, yields +22.0 percentage points (95 percent Newcombe CI [+5.1, +31.5], p=0.0192). The architectural novelty, intent-aware KG-RAG over a Contriever dense-RAG baseline (C2b to C4g_kw on the change-excluded n=362 endpoint), is +8.84 percentage points (paired McNemar p=1.79e-3); +12.43 percentage points under oracle intent. Sensitivities agree directionally: three-rater physician majority +24.0 percentage points (subject to single-author circularity); deterministic keyword reproducibility proxy +39.5 percentage points. Across the six models, the gain shrinks as the LLM-alone baseline rises (beta=-1.123, r=-0.921, p=0.009). With n=6 this looks more like regression to the mean than encoding substituting for model size. Physician adjudication identified 56 percent of auto-generated reference answers as defective, a methodological finding indicating that NLP-pipeline clinical-QA benchmarks require physician adjudication to be usable. ClinicalBench, the frozen evaluator, three-rater adjudication data, and the EpiKG output stack are publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces EpiKG, a patient knowledge graph that augments facts with assertion labels and temporality tags, and ClinicalBench, a 400-question cross-admission QA benchmark over 43 MIMIC-IV patients spanning 9 hand-selected assertion-sensitive categories. It evaluates an intent-aware KG-RAG retrieval method against a Contriever dense-RAG baseline across six LLMs, with the primary author-blind endpoint being a leave-author-out paired exact McNemar test on 50 unanimous-strict items (from 100 physician-adjudicated pairs) showing +22.0 percentage points improvement (95% Newcombe CI [+5.1, +31.5], p=0.0192). Additional results include +8.84 pp over baseline (p=1.79e-3) on the n=362 change-excluded set, +12.43 pp under oracle intent, directional agreement in sensitivities, a regression of gains versus baseline performance across models, and the finding that 56% of auto-generated references are defective per physician review. The benchmark, adjudication data, and EpiKG outputs are released publicly.

Significance. If the primary statistical result holds after addressing subset-selection concerns, the work demonstrates that assertion and temporality handling in retrieval can meaningfully improve clinical QA performance on real EHR notes, where standard dense retrieval often fails on negation, family history, and temporal distinctions. The multi-LLM ablation, leave-author-out design, external rater blinding, and public release of the frozen evaluator plus adjudication data are strengths that support reproducibility and extension. The 56% defective-reference rate is a useful methodological observation for the field. However, the narrow patient and category scope plus the small n=6 for the regression analysis limit broader claims about capturing main real-world failure modes.

major comments (2)

[Primary endpoint description] Primary endpoint (abstract and Results): The +22.0 pp headline result with p=0.0192 is computed exclusively on the 50 unanimous-strict items from the 100 adjudicated pairs. The manuscript must explicitly document whether this unanimous-strict filter was pre-specified in the analysis plan or applied after inspecting rater disagreements and model outputs; if post-hoc, a sensitivity analysis on the full 100 items (or the entire 400-question set) is required to rule out selection bias that could inflate effect size and significance.
[Regression-to-the-mean analysis] Regression analysis (Results): The claim that gains shrink with rising LLM-alone baseline (beta=-1.123, r=-0.921, p=0.009) rests on only six models. With n=6 this correlation is underpowered and sensitive to model selection; the interpretation as 'regression to the mean' should be presented with stronger caveats and perhaps supplemented by additional models or a different analysis.

minor comments (1)

[Abstract and Methods] The abstract and text use abbreviations such as 'C2b to C4g_kw' and 'EpiKG' without immediate definition on first use; ensure all acronyms and condition labels are expanded at first mention for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing statistical transparency and appropriate interpretation of exploratory analyses. We address each major comment below with specific revisions to the manuscript.

read point-by-point responses

Referee: [Primary endpoint description] Primary endpoint (abstract and Results): The +22.0 pp headline result with p=0.0192 is computed exclusively on the 50 unanimous-strict items from the 100 adjudicated pairs. The manuscript must explicitly document whether this unanimous-strict filter was pre-specified in the analysis plan or applied after inspecting rater disagreements and model outputs; if post-hoc, a sensitivity analysis on the full 100 items (or the entire 400-question set) is required to rule out selection bias that could inflate effect size and significance.

Authors: We acknowledge that the current manuscript does not explicitly state the pre-specification status of the unanimous-strict filter. The filter was applied to restrict the primary endpoint to items with full rater agreement in order to reduce adjudication noise, but to fully address the concern we will revise the Methods section to document the adjudication protocol and analysis plan. We will also add a sensitivity analysis reporting the paired McNemar results on the full 100 adjudicated pairs and on the entire 400-question set. These changes will be incorporated in the revised version. revision: yes
Referee: [Regression-to-the-mean analysis] Regression analysis (Results): The claim that gains shrink with rising LLM-alone baseline (beta=-1.123, r=-0.921, p=0.009) rests on only six models. With n=6 this correlation is underpowered and sensitive to model selection; the interpretation as 'regression to the mean' should be presented with stronger caveats and perhaps supplemented by additional models or a different analysis.

Authors: We agree that n=6 renders the regression underpowered and sensitive to model choice. In the revision we will strengthen the caveats in the Results and Discussion sections, explicitly noting the small sample size, the exploratory nature of the analysis, and that the observed negative correlation should be viewed as suggestive rather than conclusive evidence of regression to the mean. We will also add a leave-one-out robustness check on the correlation coefficient. While expanding the model set would be desirable, the current six models already span a wide range of sizes and families; we therefore treat this as a partial revision focused on clearer qualification of the finding. revision: partial

Circularity Check

1 steps flagged

Minor circularity flagged by paper in one sensitivity analysis; primary endpoint uses external raters and is independent.

specific steps

other [Abstract]
"three-rater physician majority +24.0 percentage points (subject to single-author circularity)"

The paper explicitly flags this sensitivity result as subject to single-author circularity, indicating the three-rater adjudication process incorporates the author's own judgments and thereby reduces the independence of that particular evaluation.

full rationale

The paper's central claim (leave-author-out McNemar on 50 unanimous-strict items by two external physicians) relies on independent external adjudication and is not reduced to the author's inputs by construction. The 56% defective-reference observation is presented as a methodological finding rather than a derived prediction. No self-citations, fitted inputs renamed as predictions, ansatzes, or uniqueness theorems appear. The paper itself explicitly notes the three-rater majority sensitivity as subject to single-author circularity, which is minor and non-load-bearing for the headline result. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that MIMIC-IV notes contain representative assertion and temporality phenomena and that the nine categories adequately sample the space of clinical questions where these phenomena matter.

axioms (1)

domain assumption MIMIC-IV patient notes contain the full range of negation, temporality, and attribution phenomena that cause clinical QA errors in practice
The benchmark is constructed exclusively from this dataset and the nine categories are defined around these phenomena.

invented entities (1)

EpiKG no independent evidence
purpose: Patient knowledge graph that attaches assertion labels and temporality tags to every fact and routes retrieval by question intent
Newly constructed component introduced to address the retrieval step before reasoning.

pith-pipeline@v0.9.0 · 5716 in / 1476 out tokens · 118424 ms · 2026-05-13T03:47:40.278353+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
EpiKG carries an assertion label and a temporality tag with every fact... routes retrieval by question intent... 7-value assertion taxonomy α ∈ {Pres., Abs., Poss., Cond., Hypo., Fam.Hx., Hist.}
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
leave-author-out paired exact McNemar on 50 unanimous-strict items... +22.0 percentage points

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 2 internal anchors

[1]

Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Towards expert-level medical question answering with large language models.Nature, 620: 399–404, 2023. doi: 10.1038/s41586-023-06291-2

work page doi:10.1038/s41586-023-06291-2 2023
[2]

arXiv preprint arXiv:2404.18416 (2024)

Khaled Saab, Tao Tu, Xavier Amatriain, et al. Capa- bilitiesofGeminimodelsinmedicine.arXivpreprint arXiv:2404.18416, 2024

work page arXiv 2024
[3]

Towards conversational diagnostic AI.Nature, 2024

Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tober, et al. Towards conversational diagnostic AI.Nature, 2024

work page 2024
[4]

South, Shuying Shen, and Scott L

Özlem Uzuner, Brett R. South, Shuying Shen, and Scott L. DuVall. 2010 i2b2/VA challenge on con- cepts, assertions, and relations in clinical text.Jour- naloftheAmericanMedicalInformaticsAssociation, 18(5):552–556, 2011

work page 2010
[5]

Beyond negation detection: Comprehensive assertion detection models for clinical NLP

VeyselKocaman,YigitGul,M.AytugKaya,Hasham Ul Haq, Mehmet Butgul, Cabir Celik, and David Talby. Beyond negation detection: Comprehensive assertion detection models for clinical NLP. In Text2Story Workshop at European Conference on Information Retrieval (ECIR), 2025

work page 2025
[6]

OMOP common data model v5.4.Observational Health Data Sciences and In- formatics, 2024

OHDSI Collaborative. OMOP common data model v5.4.Observational Health Data Sciences and In- formatics, 2024

work page 2024
[7]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge et al. From local to global: A graph RAG approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

GFM-RAG: Graph foundation model for retrieval augmented generation

Tianjun Luo et al. GFM-RAG: Graph foundation model for retrieval augmented generation. InAd- vances in Neural Information Processing Systems, 2025

work page 2025
[9]

KARE: Knowledge graph aug- mented reasoning via llms for clinical decision sup- port

Peng Jiang et al. KARE: Knowledge graph aug- mented reasoning via llms for clinical decision sup- port. InInternational Conference on Learning Rep- resentations, 2025

work page 2025
[10]

Medical-Graph-RAG: Towards safe medical large language model via graph retrieval- augmented generation

Junde Wu et al. Medical-Graph-RAG: Towards safe medical large language model via graph retrieval- augmented generation. InProceedings of the 63rd AnnualMeetingoftheAssociationforComputational Linguistics, 2025

work page 2025
[11]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Preston Rasmussen, Pavlo Paliychuk, Travis Beau- vais,JackRyan,andDanielChalef. Zep: Atemporal knowledge graph architecture for agent memory. arXiv preprint arXiv:2501.13956, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Predicting future disorders via temporal knowledge graphs and medicalontologies.IEEEJournalofBiomedicaland Health Informatics, 28(7):4238–4248, 2024

Marco Postiglione, Daniel Bean, Zeljko Kraljevic, Richard Dobson, and Vincenzo Moscato. Predicting future disorders via temporal knowledge graphs and medicalontologies.IEEEJournalofBiomedicaland Health Informatics, 28(7):4238–4248, 2024. doi: 10.1109/JBHI.2024.3390419

work page doi:10.1109/jbhi.2024.3390419 2024
[13]

HealthBenchProfessional: Evaluatingclin- ical reasoning in large language models.https:// openai.com/research/healthbench,2026

OpenAI. HealthBenchProfessional: Evaluatingclin- ical reasoning in large language models.https:// openai.com/research/healthbench,2026. Ac- cessed 26 April 2026

work page 2026
[14]

Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng,HanyiFang,andPeterSzolovits.Whatdisease 15 does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021. doi: 10.3390/ app11146421

work page 2021
[15]

MIRAGE: Medical infor- mation retrieval-augmented generation evaluation

Guangzhi Xiong et al. MIRAGE: Medical infor- mation retrieval-augmented generation evaluation. InFindings of the Association for Computational Linguistics: ACL, 2024

work page 2024
[16]

emrQA: A large corpus for question answering on electronic medical records

Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. emrQA: A large corpus for question answering on electronic medical records. InPro- ceedings of the Conference on Empirical Methods in Natural Language Processing, pages 2357–2368, 2018

work page 2018
[17]

Multi-LLM KG-RAG: End-to- end clinical knowledge graph construction.arXiv preprint arXiv:2601.01844, 2026

Wei Chen et al. Multi-LLM KG-RAG: End-to- end clinical knowledge graph construction.arXiv preprint arXiv:2601.01844, 2026

work page arXiv 2026
[18]

AutoRD:Anautomaticandend-to-end systemforrarediseaseknowledgegraphconstruction

LangLietal. AutoRD:Anautomaticandend-to-end systemforrarediseaseknowledgegraphconstruction. JMIR Medical Informatics, 12, 2024

work page 2024
[19]

RECAP-KG:MiningknowledgegraphsfromrawGP notes for remote COVID-19 assessment in primary care.arXiv preprint arXiv:2306.17175, 2023

Rakhilya Lee Mekhtieva, Brandon Forbes, Dalal Alrajeh, Brendan Delaney, and Alessandra Russo. RECAP-KG:MiningknowledgegraphsfromrawGP notes for remote COVID-19 assessment in primary care.arXiv preprint arXiv:2306.17175, 2023

work page arXiv 2023
[20]

Chapman, Will Bridewell, Paul Hanbury, Gregory F

Wendy W. Chapman, Will Bridewell, Paul Hanbury, Gregory F. Cooper, and Bruce G. Buchanan. A simple algorithm for identifying negated findings and diseases in discharge summaries.Journal of Biomedical Informatics, 34(5):301–310, 2001

work page 2001
[21]

Dowling, Tyler Thornblade, and Wendy W

Henk Harkema, John N. Dowling, Tyler Thornblade, and Wendy W. Chapman. ConText: An algorithm for determining negation, experiencer, and temporal status from clinical reports.Journal of Biomedical Informatics, 42(5):839–851, 2009

work page 2009
[22]

Claude E. Shannon. A mathematical theory of com- munication.Bell System Technical Journal, 27(3): 379–423, 1948

work page 1948
[23]

NEJM AI , volume =

Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y. Ng, and JonathanH.Chen. MedAgentBench: AvirtualEHR environment to benchmark medical LLM agents. NEJM AI, 2(9), 2025. doi: 10.1056/AIdbp2500144

work page doi:10.1056/aidbp2500144 2025
[24]

Johnson, Lucas Bulgarelli, Lu Shen, etal.MIMIC-IV,afreelyaccessibleelectronichealth record dataset.Scientific Data, 10:1, 2023

Alistair E.W. Johnson, Lucas Bulgarelli, Lu Shen, etal.MIMIC-IV,afreelyaccessibleelectronichealth record dataset.Scientific Data, 10:1, 2023

work page 2023
[25]

Tibshirani.An Intro- duction to the Bootstrap

Bradley Efron and Robert J. Tibshirani.An Intro- duction to the Bootstrap. Chapman and Hall/CRC, 1993

work page 1993
[26]

Colin Cameron and Douglas L

A. Colin Cameron and Douglas L. Miller. A practi- tioner’s guide to cluster-robust inference.Journal of Human Resources, 50(2):317–372, 2015

work page 2015
[27]

Note on the sampling error of the difference between correlated proportions or percentages.Psychometrika, 12(2):153–157, 1947

Quinn McNemar. Note on the sampling error of the difference between correlated proportions or percentages.Psychometrika, 12(2):153–157, 1947

work page 1947
[28]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The mea- surement of observer agreement for categorical data. Biometrics, 33(1):159–174, 1977

work page 1977
[29]

Valid post-selection infer- ence.Annals of Statistics, 41(2):802–837, 2013

Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao. Valid post-selection infer- ence.Annals of Statistics, 41(2):802–837, 2013

work page 2013
[30]

Andre Bittar, Sumithra Velupillai, Johnny Downs, Rosemary Sedgwick, and Rina Dutta. Portability of natural language processing methods to detect suicidality from unstructured clinical text in us and uk electronic health records.Journal of the Amer- ican Medical Informatics Association Open, 6(3): ooad078, 2023. doi: 10.1093/jamiaopen/ooad078

work page doi:10.1093/jamiaopen/ooad078 2023
[31]

Shah, John D

Nigam H. Shah, John D. Halamka, Suchi Saria, Michael Pencina, Troy Tazbaz, Micky Tripathi, Al- ison Callahan, Hailey Hildahl, and Brian Ander- son. A nationwide network of health ai assurance laboratories.JAMA, 331(3):245–249, 2024. doi: 10.1001/jama.2023.26930

work page doi:10.1001/jama.2023.26930 2024
[32]

Peter J. Embi. Algorithmovigilance—advancing methods to analyze and monitor artificial intelligence–driven health care for effectiveness and equity.JAMA Network Open, 4(4):e214622, 2021. doi: 10.1001/jamanetworkopen.2021.4622

work page doi:10.1001/jamanetworkopen.2021.4622 2021
[33]

[dataset] ClinicalBench: Assertion- sensitive clinical question answering bench- mark

Alex Stinard. [dataset] ClinicalBench: Assertion- sensitive clinical question answering bench- mark. https://huggingface.co/datasets/ alexstinard/epikg-clinicalbench,2026. Ac- cessed 25 April 2026

work page 2026
[34]

Aboutness: To- wards foundations for the information artifact on- tology.International Conference on Biomedical Ontology, 2015

Werner Ceusters and Barry Smith. Aboutness: To- wards foundations for the information artifact on- tology.International Conference on Biomedical Ontology, 2015. 16

work page 2015
[35]

Snodgrass.Developing Time-Oriented Database Applications in SQL

Richard T. Snodgrass.Developing Time-Oriented Database Applications in SQL. Morgan Kaufmann, 2000

work page 2000
[36]

James F. Allen. Maintaining knowledge about tem- poral intervals.Communications of the ACM, 26 (11):832–843, 1983

work page 1983
[37]

TEO: A time event ontology for clinical narratives.Journal of the American Medical Informatics Association, 27(10): 1560–1568, 2020

Fei Li, Jianfu Hong, Cui Tao, et al. TEO: A time event ontology for clinical narratives.Journal of the American Medical Informatics Association, 27(10): 1560–1568, 2020

work page 2020
[38]

Tem- poral cohort logic.AMIA Annual Symposium Pro- ceedings, 2022:1237–1246, 2023

YanHuang,XiaojinLi,andGuo-QiangZhang. Tem- poral cohort logic.AMIA Annual Symposium Pro- ceedings, 2022:1237–1246, 2023

work page 2022
[39]

Doctorrag: Medical rag emulating doctor-like reasoning

Yifan Lu, Tianyu Fu, et al. Doctorrag: Medical rag emulating doctor-like reasoning. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[40]

MedRAG: Enhancing medical diagnosis through retrieval-augmented generation with knowledge graph-elicited reasoning.Proceed- ings of The Web Conference, 2025

Yusheng Wang et al. MedRAG: Enhancing medical diagnosis through retrieval-augmented generation with knowledge graph-elicited reasoning.Proceed- ings of The Web Conference, 2025

work page 2025
[41]

Samuel Thio, Matthew Lewis, Spiros Denaxas, and Richard J. B. Dobson. Unlocking electronic health records: A hybrid graph RAG approach to safe clinical AI for patient QA.arXiv preprint arXiv:2602.00009, 2025

work page arXiv 2025
[42]

Graph retrieval-augmented generation: A survey.ACM Transactions on Information Systems,

Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, HaizhouShi,ChuntaoHong,YanZhang,andSiliang Tang. Graph retrieval-augmented generation: A survey.ACM Transactions on Information Systems,

work page
[43]

doi: 10.1145/3777378

work page doi:10.1145/3777378
[44]

EHR-RAG: Bridging long-horizon structured electronic health records and large language models via enhanced retrieval-augmented generation.arXiv preprint arXiv:2601.21340, 2026

Lang Cao, Qingyu Chen, and Yue Guo. EHR-RAG: Bridging long-horizon structured electronic health records and large language models via enhanced retrieval-augmented generation.arXiv preprint arXiv:2601.21340, 2026

work page arXiv 2026
[45]

Patterson, Matthew Churpek, Tim- othy Miller, Dmitriy Dligach, and Majid Afshar

Yanjun Gao, Ruizhe Li, Emma Croxford, John Caskey, Brian W. Patterson, Matthew Churpek, Tim- othy Miller, Dmitriy Dligach, and Majid Afshar. Leveraging medical knowledge graphs into large language models for diagnosis prediction: Design and application study.JMIR AI, 4(1):e58670, 2025. doi: 10.2196/58670

work page doi:10.2196/58670 2025
[46]

Unsuperviseddenseinformation retrieval with contrastive learning.Transactions on Machine Learning Research, 2022

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, andEdouardGrave. Unsuperviseddenseinformation retrieval with contrastive learning.Transactions on Machine Learning Research, 2022

work page 2022
[47]

no pneumonia,

YifanPeng,XiaosongWang,LeLu,Mohammadhadi Bagheri,RonaldSummers,andZhiyongLu. NegBio: Ahigh-performancetoolfornegationanduncertainty detection in radiology reports.AMIA Summits on Translational Science Proceedings, 2018. 17 A Notation Summary Symbol Meaning 𝛼∈ AAssertion label (7 values, Eq. 1) 𝝉𝑣 Valid time (event date, valid from/to) 𝝉𝑡 Transaction time ...

work page doi:10.57967/hf/8549 2018
[48]

Reference answers systematically wrong for medication change questions(103 notes mentioning reference- answer issues): Every reference answer in the change category conflates inpatient medication orders (heparin, IV antibiotics, CIWA protocol) with discharge medications

work page
[49]

Zero C4g items received this complaint

C1 hallucinates from limited context(15 notes, exclusively C1): Without retrieval, C1 fabricates admission IDs, medication names, and clinical scenarios. Zero C4g items received this complaint

work page
[50]

call if fever>101.5

NLP assertion classifier propagates errors(8 notes): Boilerplate discharge instructions (“call if fever>101.5”) tagged as clinical findings; “h/o recently diagnosed metastatic cancer” tagged as historical

work page
[51]

hallucinates DNR confirmation

Safety-critical errors(10 items flagged as potentially harmful): Code status errors (model “hallucinates DNR confirmation” when chart says full code), active cancer missed from medication list, anticoagulation misclassified

work page
[52]

Modelpraisedwhenreferenceanswerswerewrong(63notes): Reviewernotedthemodelgaveclinicallycorrect answers that the automated benchmark reference penalized. X.5 Reference Answer Version History The ClinicalBench reference answers have undergone iterative refinement: • v1 (auto-generated reference set): Reference answers created by LLM from MIMIC-IV notes via ...

work page
[53]

history of heart failure

NLP assertion classifier error(28 questions, 36%): The dominant failure. Manifests as: “history of heart failure” →“heartfailureisresolved”(clinicalidiommeansactivechroniccondition);“edema,likelyduetononcompliance” → “edema is uncertain” (causal vs. existential uncertainty conflation); experiencer tag reversal (patient’s atrial fibrillation labeled as fam...

work page
[54]

2+ pitting edema bilaterally

Wrong answer / inverted truth(16 questions, 21%): The reference answer states the opposite of the chart. Example: the reference says pitting edema is absent when PE documents “2+ pitting edema bilaterally.”

work page
[55]

call if fever>101

Non-clinical entity extraction(11 questions, 14%): NLP extracted boilerplate (“call if fever>101”), devices (Foley catheter as diagnosis), lab values (blood sugar as diagnosis), or section headers (“Allergies” as medical condition)

work page
[56]

discharge medications, or admission med-rec vs

Medication list conflation(10 questions, 13%): Change questions compared wrong lists—inpatient orders (heparin, IV antibiotics) vs. discharge medications, or admission med-rec vs. discharge list. PRN-only medications (CIWA Valium) counted as prescribed

work page
[57]

pneumonia is absent

Fabricated temporal relationship(8 questions, 10%): Sequence questions claimed ordering not supported by the chart—both conditions in the same admission with no temporal anchoring, or based on NLP-extracted entities from negated text. X.7 Impact on Reported Numbers Because the detected defects are question-level rather than condition-specific, they are le...

work page 2010