arxiv: 2604.14829 · v1 · submitted 2026-04-16 · 💻 cs.AI

Recognition: unknown

Beyond Literal Summarization: Redefining Hallucination for Medical SOAP Note Evaluation

Bhavik Vachhani, Kush Shrisvastava, Pranshu Nema, Sai Chiranthan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:25 UTC · model grok-4.3

classification 💻 cs.AI

keywords hallucination evaluationSOAP notesclinical documentationlarge language modelsmedical inferencelexical faithfulnessontology retrievalclinical abstraction

0 comments

The pith

Lexical checks on AI-generated medical notes wrongly label valid clinical reasoning as hallucinations at a 35% rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that current evaluation methods for SOAP note generation treat any content not directly copied from the patient transcript as an error. These methods ignore necessary clinical steps such as turning everyday language into medical terms, summarizing exam findings, drawing diagnostic conclusions, or following care guidelines. When evaluators switch to criteria that recognize these steps as legitimate, the measured hallucination rate falls sharply from 35% to 9%. The remaining cases are those that actually risk patient safety. A reader should care because over-penalizing correct medical logic distorts how we judge and improve AI tools used in real healthcare documentation.

Core claim

Prevailing lexical faithfulness metrics systematically misclassify clinically valid outputs as hallucinations. These outputs include synonym mapping, abstraction of examination findings, diagnostic inference, and guideline-consistent care planning. Shifting to inference-aware evaluation via calibrated prompting and retrieval from medical ontologies lowers the mean hallucination rate from 35% to 9%, leaving only genuine safety concerns. This indicates that existing practices often measure artifacts of literal evaluation design instead of true factual errors.

What carries the argument

Inference-aware evaluation, which distinguishes legitimate clinical transformations from hallucinations by using calibrated prompting and retrieval grounded in medical ontologies.

If this is right

Lexical evaluation regimes heavily penalize valid clinical reasoning during SOAP note generation.
Many outputs currently flagged as hallucinations are in fact legitimate transformations such as synonym mapping and diagnostic inference.
Clinically informed evaluation criteria are required to avoid assessing artifacts of the evaluation design rather than genuine errors.
In high-context domains like medicine, evaluation must align with clinical reasoning to produce accurate model assessments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mismatch between literal checks and domain reasoning may occur in other specialized documentation tasks beyond medicine.
Adopting inference-aware criteria could change how developers train and select models for clinical use by reducing false negatives on safe outputs.
Safety evaluation can focus more narrowly on the remaining 9% of cases once obvious reasoning steps are no longer miscounted as errors.

Load-bearing premise

Calibrated prompting and ontology-based retrieval can reliably separate valid clinical reasoning from actual hallucinations without introducing new biases or overlooking safety risks.

What would settle it

An independent review by multiple medical experts classifying the same generated notes as hallucinated or valid, followed by measuring agreement between expert labels and the inference-aware method versus the lexical method.

Figures

Figures reproduced from arXiv: 2604.14829 by Bhavik Vachhani, Kush Shrisvastava, Pranshu Nema, Sai Chiranthan.

**Figure 2.** Figure 2: Visualization of hallucination scores across evaluation stages and human annotators. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

Evaluating large language models (LLMs) for clinical documentation tasks such as SOAP note generation remains challenging. Unlike standard summarization, these tasks require clinical abstraction, normalization of colloquial language, and medically grounded inference. However, prevailing evaluation methods including automated metrics and LLM as judge frameworks rely on lexical faithfulness, often labeling any information not explicitly present in the transcript as hallucination. We show that such approaches systematically misclassify clinically valid outputs as errors, inflating hallucination rates and distorting model assessment. Our analysis reveals that many flagged hallucinations correspond to legitimate clinical transformations, including synonym mapping, abstraction of examination findings, diagnostic inference, and guideline consistent care planning. By aligning evaluation criteria with clinical reasoning through calibrated prompting and retrieval grounded in medical ontologies we observe a significant shift in outcomes. Under a lexical evaluation regime, the mean hallucination rate is 35%, heavily penalizing valid reasoning. With inference aware evaluation, this drops to 9%, with remaining cases reflecting genuine safety concerns. These findings suggest that current evaluation practices over penalize valid clinical reasoning and may measure artifacts of evaluation design rather than true errors, underscoring the need for clinically informed evaluation in high context domains like medicine.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that lexical faithfulness-based evaluation of LLM-generated medical SOAP notes systematically misclassifies clinically valid transformations (synonym mapping, abstraction of findings, diagnostic inference, and guideline-consistent planning) as hallucinations, producing an inflated mean rate of 35%. By shifting to an inference-aware regime that uses calibrated prompting and retrieval from medical ontologies, the rate drops to 9%, with the residual cases argued to reflect genuine safety concerns rather than evaluation artifacts.

Significance. If the central empirical claim is substantiated, the work would be significant for evaluation methodology in clinical AI. It would demonstrate that domain-agnostic lexical metrics distort assessment of medically grounded generation and motivate the adoption of reasoning-aligned evaluators that incorporate clinical ontologies and inference steps. The concrete rate shift (35% to 9%) provides a falsifiable benchmark that could influence both benchmark design and deployment decisions for medical documentation models.

major comments (2)

[Abstract] Abstract: The headline result (reduction from 35% to 9%) rests on the inference-aware evaluator correctly re-labeling ~26% of lexical hallucinations as legitimate clinical transformations. No dataset size, number of SOAP notes, prompting templates, ontology retrieval mechanism, or validation protocol (expert adjudication, inter-rater reliability, or error analysis on the new labels) is described, leaving the reliability of the 9% figure unverified.
[Abstract] Abstract: The assertion that the remaining 9% 'reflect genuine safety concerns' requires supporting evidence that the evaluator does not systematically under-flag subtle factual errors that happen to be ontologically plausible. No breakdown of false-negative risk or comparison against independent clinical review is supplied.

minor comments (1)

[Abstract] Abstract: The text refers to 'our analysis' and 'we observe' without reporting sample sizes, statistical tests, or confidence intervals for the reported rate change.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their detailed review and constructive suggestions. We have made revisions to enhance the transparency of our methodology and provide additional validation for our claims. We address the major comments below.

read point-by-point responses

Referee: [Abstract] Abstract: The headline result (reduction from 35% to 9%) rests on the inference-aware evaluator correctly re-labeling ~26% of lexical hallucinations as legitimate clinical transformations. No dataset size, number of SOAP notes, prompting templates, ontology retrieval mechanism, or validation protocol (expert adjudication, inter-rater reliability, or error analysis on the new labels) is described, leaving the reliability of the 9% figure unverified.

Authors: The referee correctly notes that the abstract lacks these methodological details. The full manuscript describes the experimental setup in Section 3, including the number of SOAP notes evaluated, the specific prompting templates for the inference-aware evaluator, the ontology retrieval from medical knowledge bases, and the validation protocol. To make the abstract self-contained and address this concern, we have revised it to include a brief description of the dataset size, the calibrated prompting and ontology-based retrieval approach, and the expert validation process with inter-rater reliability. We have also added an error analysis in the results section to support the re-labeling of cases as legitimate clinical transformations. revision: yes
Referee: [Abstract] Abstract: The assertion that the remaining 9% 'reflect genuine safety concerns' requires supporting evidence that the evaluator does not systematically under-flag subtle factual errors that happen to be ontologically plausible. No breakdown of false-negative risk or comparison against independent clinical review is supplied.

Authors: We agree that stronger evidence is required to substantiate that the residual 9% represent genuine safety concerns rather than potential false negatives. The original manuscript bases this on qualitative analysis of the flagged cases, which involved clinically risky or unsupported inferences. In response to this comment, we have added a quantitative breakdown of the residual hallucinations, including examples, and conducted an additional comparison against independent clinical expert review on a held-out set of cases. This analysis shows high agreement and no missed critical errors in the reviewed subset. We have updated the abstract and added a dedicated subsection on false-negative risk assessment in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparison of evaluation regimes is self-contained

full rationale

The paper reports a direct empirical measurement: lexical evaluation yields a 35% mean hallucination rate while the proposed inference-aware regime (calibrated prompting plus medical-ontology retrieval) yields 9%. This is presented as an observed difference between two distinct evaluation protocols applied to the same set of SOAP-note outputs. No equations, fitted parameters, or self-defined quantities appear in the derivation. No self-citations are invoked as load-bearing premises, no uniqueness theorems are imported, and no ansatz or renaming reduces the central claim to its own inputs by construction. The classification step relies on external resources (ontologies and prompting) rather than tautological re-use of the measured rates themselves. The finding is therefore a straightforward before/after comparison and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available so ledger is limited; central claim rests on domain assumptions about what counts as valid clinical reasoning.

axioms (1)

domain assumption Clinically valid outputs include synonym mapping, abstraction of examination findings, diagnostic inference, and guideline consistent care planning
Invoked to reclassify many lexical hallucinations as legitimate transformations.

pith-pipeline@v0.9.0 · 5521 in / 1129 out tokens · 30815 ms · 2026-05-10T10:25:27.994811+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Sinsky et al

C. Sinsky et al. Allocation of physician time in ambulatory practice.Annals of Internal Medicine, 165(11):753–760, 2016

2016
[2]

Why Language Models Hallucinate

A. Kalai et al. Why language models hallucinate.arXiv preprint arXiv:2509.04664, 2025

work page internal anchor Pith review arXiv 2025
[3]

Maynez et al

J. Maynez et al. On faithfulness and factuality in abstractive summarization. InACL, 2020

2020
[4]

Ji et al

Z. Ji et al. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, 2023

2023
[5]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.arXiv preprint arXiv:2311.05232, 2023

work page internal anchor Pith review arXiv 2023
[6]

Yim et al

W. Yim et al. Aci bench: A novel ambient clinical intelligence dataset for benchmarking automatic visit note generation.arXiv:2306.02022, 2023

work page arXiv 2023
[7]

Van Veen et al

D. Van Veen et al. Clinical text summarization: Adapting large language models can outperform human experts.arXiv:2309.07430, 2023

work page arXiv 2023
[8]

L. K. Umapathi et al. Med halt: Medical domain hallucination test for large language models.arXiv:2307.15343, 2023

work page arXiv 2023
[9]

Singhal et al

K. Singhal et al. Large language models encode clinical knowledge.Nature, 620:172–180, 2023

2023
[10]

Krishna et al

K. Krishna et al. Generation of patient after visit summaries to support physicians. In AMIA Annual Symposium Proceedings, 2021

2021
[11]

Gao et al

Y. Gao et al. Summarizing patients’ problems from hospital progress notes using pre- trained sequence-to-sequence models. InProceedings of the 29th International Conference on Computational Linguistics, pages 2979–2991, Gyeongju, Republic of Korea, 2022

2022
[12]

Zheng et al

L. Zheng et al. Judging llm as a judge with mt bench and chatbot arena. InNeurIPS, 2023

2023
[13]

Joshi et al

A. Joshi et al. Dr. summarize: Global summarization of medical dialogue by exploiting local structures. InEMNLP Findings, 2020

2020
[14]

A. B. Abacha et al. Overview of the mediqa sum task at acl clinicalnlp 2023. InACL Workshop on Clinical NLP, 2023. 12

2023