Hallucination Detection-Guided Preference Optimization for Clinical Summarization

Andrew McCallum; Avijit Mitra; Dung Ngoc Thai; Rami Matar; Shamanth Kuthpadi Seethakantha; Simran Tiwari; Vara Prasad Gudi; Wael Salloum; Wenlong Zhao

arxiv: 2605.28910 · v1 · pith:3B27LLZYnew · submitted 2026-05-27 · 💻 cs.CL · cs.AI

Hallucination Detection-Guided Preference Optimization for Clinical Summarization

Shamanth Kuthpadi Seethakantha , Dung Ngoc Thai , Vara Prasad Gudi , Simran Tiwari , Rami Matar , Avijit Mitra , Wenlong Zhao , Wael Salloum

show 1 more author

Andrew McCallum

This is my paper

Pith reviewed 2026-06-29 12:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords hallucination reductionclinical summarizationpreference optimizationfactual faithfulnessLLM alignmentMIMIC-IV

0 comments

The pith

Detector-guided iterative revisions and preference pairs derived from them reduce hallucinations in clinical note summaries by up to 48 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models frequently insert unsupported statements when summarizing clinical notes, limiting their use in healthcare. The paper tests an inference-time process that applies hallucination detectors to iteratively revise summaries toward factual corrections. It then converts the resulting revision trajectories into preference pairs and uses them to fine-tune the model. Experiments on real-world notes from the MIMIC-IV database show that the inference-time method cuts hallucinations by 24 percent and the preference-learning method cuts them by 48 percent on Llama-3.1-8B-Instruct, with human experts and LLM juries reporting no drop in fluency, coherence, or relevance. The same pattern holds for Gemma models.

Core claim

Detector-guided iterative refinement produces factual corrections at inference time; converting those refinement trajectories into preference pairs for direct preference optimization produces models that generate summaries with substantially fewer unsupported statements on clinical notes.

What carries the argument

Hallucination-detector-guided iterative revision trajectories converted into preference pairs for model fine-tuning.

If this is right

The inference-time revision method alone reduces hallucinations by 24 percent on Llama-3.1-8B-Instruct while preserving summary quality.
The preference-learning method reduces hallucinations by 48 percent on the same model with no measured loss in fluency, coherence, or relevance.
Both methods apply across Llama and Gemma model families when summarizing real clinical notes from MIMIC-IV.
Detection-informed refinement and preference learning together provide an automated route to higher factual faithfulness in clinical summarization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same detector-guided trajectory approach could be tested on other medical text-generation tasks such as discharge-instruction generation or radiology report expansion.
Further gains would require detectors whose error patterns do not overlap with the errors the model is already making.
The preference pairs created this way are synthetic and could be combined with human preference data to test whether mixed training yields additional improvements.

Load-bearing premise

The hallucination detectors used to guide revisions and create preference pairs are accurate enough that they do not systematically miss errors or introduce new ones that would invalidate the measured reductions.

What would settle it

Independent clinicians counting unsupported statements in summaries produced by the fine-tuned models on a new set of MIMIC-IV notes would show no reduction relative to the base models.

Figures

Figures reproduced from arXiv: 2605.28910 by Andrew McCallum, Avijit Mitra, Dung Ngoc Thai, Rami Matar, Shamanth Kuthpadi Seethakantha, Simran Tiwari, Vara Prasad Gudi, Wael Salloum, Wenlong Zhao.

**Figure 1.** Figure 1: Overview of hallucination mitigation via detection-informed self-refinement. Given an input clinical note, a language model generates an initial summary that may contain unsupported or hallucinated medical content. A hallucination detector identifies unsupported content, which is used to guide iterative self-refinement toward removing factual errors rather than stylistic changes (top; HDSR). The intermedia… view at source ↗

read the original abstract

Large language models (LLMs) have shown promise on summarization tasks, but they often produce hallucinations, which are unsupported or incorrect statements that limit their reliability in specialized healthcare applications. We introduce \itermodelfull (\itermodel), an inference-time method that leverages hallucination detectors to guide iterative summary revisions toward factual corrections. Building on this, we propose \itermodel for Preference Learning (\model), which converts detector-guided refinement trajectories into preference pairs for model finetuning. Extensive experiments show that our methods substantially reduce hallucinations for Llama and Gemma models in summarizing real-world clinical notes from \MimicIV. For example, \itermodel reduces 24\% and \model reduces 48\% hallucinations in Llama-3.1-8B-Instruct. Importantly, both methods preserve summary fluency, coherence, and relevance according to human expert and LLM-Jury evaluations. Together, these results demonstrate that detection-informed refinement and preference learning offer an automated solution for improving factual faithfulness in clinical summarization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper chains hallucination detectors into iterative revision at inference time and then into preference pairs for fine-tuning on clinical notes, with reported reductions that rest on the detector's own outputs.

read the letter

The paper's main contribution is an inference-time method that uses hallucination detectors to iteratively revise clinical summaries, followed by a preference learning version that turns those revision trajectories into training pairs. On Llama-3.1-8B-Instruct summarizing MIMIC-IV notes, it claims 24% reduction from the inference method and 48% from the fine-tuned one, while human experts and an LLM jury see no drop in fluency, coherence, or relevance.

What is new is the end-to-end use of detector outputs to both correct summaries on the fly and generate preference data for the clinical domain. It takes existing detection tools and alignment techniques and applies them in a loop tailored to medical note summarization.

This has value because it offers an automated path to better factual faithfulness without extra human labels. The choice of real-world clinical notes from MIMIC-IV grounds the experiments in a high-stakes setting where hallucinations matter.

The soft spot is the lack of independent verification for the hallucination reductions. Since the same detectors guide the revisions and score the outputs, any systematic error in the detector could make the improvements look larger than they are. The evaluations from humans and the jury focus only on style and relevance, not on whether the content is more accurate or less hallucinated. Without details on detector performance or separate factuality checks, the percentage claims are difficult to trust fully.

This work is for researchers focused on making LLMs safer for healthcare applications. A reader interested in practical ways to combine detection and preference tuning would find the pipeline worth considering.

The paper shows honest engagement with the problem and has enough structure to warrant review. I would send it to peer review so that the methods and any additional analyses can be examined closely.

Referee Report

3 major / 2 minor

Summary. The paper introduces iTermodel, an inference-time method that uses hallucination detectors to iteratively revise clinical summaries, and iTermodel for Preference Learning (iTermodel), which converts detector-guided refinement trajectories into preference pairs for LLM fine-tuning. On summarizing real-world clinical notes from MIMIC-IV, the methods are reported to reduce hallucinations by 24% (iTermodel) and 48% (iTermodel) for Llama-3.1-8B-Instruct (with similar gains for Gemma models), while preserving fluency, coherence, and relevance per human expert and LLM-Jury evaluations.

Significance. If the hallucination reductions hold under independent validation, the work provides a practical, automated pipeline for improving factual faithfulness in clinical summarization without degrading other summary qualities. The detector-guided preference optimization approach could be impactful for safety-critical domains where hallucinations limit LLM deployment.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): The central quantitative claims of 24% and 48% hallucination reduction are derived solely from the hallucination detector outputs; the manuscript provides no independent human annotation of hallucination presence/absence in the revised vs. baseline summaries to confirm these percentages reflect true factual improvements rather than detector bias or error.
[§3.1 and §3.2] §3.1 (iTermodel) and §3.2 (Preference Pair Construction): The methods depend on the detector being sufficiently accurate to both guide revisions and label preference pairs, yet no precision/recall figures, error analysis, or ablation on detector quality are reported for the MIMIC-IV summarization domain; this is load-bearing because systematic under-detection in revised outputs would artifactually inflate the reported gains.
[§4.3] §4.3 (Human and LLM-Jury Evaluation): Evaluations are restricted to fluency, coherence, and relevance; the absence of a direct hallucination-focused human study means there is no cross-check against the detector-based metric that underpins the primary results.

minor comments (2)

[Abstract] The abstract introduces the method names without immediate expansion; ensure the first use in the introduction spells out the full names for clarity.
[§4] Table captions and experimental setup descriptions should explicitly list all baselines, detector variants, and statistical tests used for the percentage reductions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's thorough review and the opportunity to address these important points regarding our evaluation of hallucination reductions. We provide detailed responses to each major comment and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: Abstract and §4: The central quantitative claims of 24% and 48% hallucination reduction are derived solely from the hallucination detector outputs; the manuscript provides no independent human annotation of hallucination presence/absence in the revised vs. baseline summaries to confirm these percentages reflect true factual improvements rather than detector bias or error.

Authors: We acknowledge that the 24% and 48% reductions are measured via the hallucination detector, which enables scalable, consistent quantification across the MIMIC-IV test set. The human expert and LLM-Jury evaluations in §4.3 were intentionally focused on fluency, coherence, and relevance to verify that detector-guided revisions do not degrade other summary properties. Because the iterative process explicitly targets and corrects detector-flagged spans, the reduction in detected hallucinations is a direct outcome of the method. We will add a limitations paragraph clarifying the reliance on the detector and noting that future work could include targeted human hallucination annotation; no such annotation is added in this revision due to resource constraints. revision: partial
Referee: §3.1 and §3.2: The methods depend on the detector being sufficiently accurate to both guide revisions and label preference pairs, yet no precision/recall figures, error analysis, or ablation on detector quality are reported for the MIMIC-IV summarization domain; this is load-bearing because systematic under-detection in revised outputs would artifactually inflate the reported gains.

Authors: The detector is drawn from prior published work whose general-domain performance is documented in its source paper. We did not report MIMIC-IV-specific precision/recall or conduct an ablation in the submitted manuscript. In the revision we will insert a short error-analysis subsection that (a) cites the detector’s original metrics, (b) discusses the risk of domain shift, and (c) notes that any systematic under-detection would affect both baseline and revised outputs equally, thereby preserving the relative reduction. If space permits, we will also report a small manual spot-check on a subset of samples. revision: partial
Referee: §4.3: Evaluations are restricted to fluency, coherence, and relevance; the absence of a direct hallucination-focused human study means there is no cross-check against the detector-based metric that underpins the primary results.

Authors: We agree that a dedicated human hallucination annotation study would constitute an independent cross-check. Our current protocol instead uses expert review of overall summary quality together with an LLM-Jury to confirm that revisions preserve (or improve) readability and clinical utility. We will revise §4.3 and the discussion section to explicitly state the absence of hallucination-specific human labels and to frame this as a natural direction for follow-up work. The combination of detector-driven quantitative gains and preserved qualitative scores remains the core evidence presented. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical methods or measurements

full rationale

The paper introduces empirical methods (iTermodel for inference-time revision and iTermodel for preference learning) that use hallucination detectors to guide revisions and create training pairs, with results reported as percentage reductions on MIMIC-IV data. No derivation chain, equations, or fitted parameters are described that reduce any claimed prediction or result to quantities defined by the paper's own inputs. The work contains no self-definitional steps, fitted-input predictions, or load-bearing self-citations that would trigger the enumerated circularity patterns. The central claims rest on experimental outcomes rather than any mathematical construction that collapses to its premises.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5741 in / 1151 out tokens · 30876 ms · 2026-06-29T12:26:30.123657+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references

[8]

Return only the revised summary text (no explanations or markup). --- [SOURCE FACTS / BRIEF HOSPITAL COURSE] {{ context }} 7 [DRAFT SUMMARY WITH FEEDBACK ANNOTATIONS] {{ summary }} <|end|> Self-refinement revision with detectors Prompt <|system|> You are a clinical summarization assistant. <|end|> <|user|> You will be given: - A brief hospital course (SOU...
[9]

Treat the SOURCE FACTS as authoritative
[10]

</error>: - If the content is unsupported or contradicted by the SOURCE FACTS -> remove or correct it

For each segment wrapped in <error> ... </error>: - If the content is unsupported or contradicted by the SOURCE FACTS -> remove or correct it. - If the content is partially correct -> rewrite it using information from the SOURCE FACTS. - If the content is accurate -> keep it, but remove the <error> tags
[11]

Check for missing or incomplete information in the DRAFT SUMMARY compared to the SOURCE FACTS, especially: - Chief complaint / reason for visit - Presenting symptoms - Procedures performed - Medications (new, changed, or discontinued) - Vital signs - Key laboratory or imaging findings
[12]

Do **not** invent or infer new diagnoses, medications, procedures, or dates
[13]

Keep professional tone and structure suitable for a discharge or after-visit summary
[14]

Prefer terms and values exactly as stated in the SOURCE FACTS
[15]

a key procedure or discharge medication), ADD it

If the DRAFT SUMMARY is missing clinically important info that IS present in the SOURCE FACTS (e.g. a key procedure or discharge medication), ADD it
[16]

You were admitted

Always start the summary with "You were admitted" and refer to the patient as you / your
[17]

Return only the revised summary text (no explanations or markup). --- [SOURCE FACTS / BRIEF HOSPITAL COURSE] {{ context }} [DRAFT SUMMARY WITH FEEDBACK ANNOTATIONS] {{ summary_with_errors }} <|end|> A.3 Hallucination Detection Medalign zero-shot Prompt <|system|> You are a helpful assistant that helps patients understand their medical records. <|end|> <|u...
[19]

Please take your medications as prescribed

Incorrect fact And below is the detailed guideline, and we label error spans with the <error> tag (e.g. <error>incorrect fact</error>). ### Allowed General Medical Knowledge and Medical Advice We allow general medical knowledge and advice that is often part of the AVS. Usually, these are information that are not specific for the hospital course given in t...
[21]

We performed an <error>esophageal-gastro-duodenoscopy (EGD).<error>

Incorrect fact And below is the detailed guideline, and we label error spans with the <error> tag (e.g. <error>incorrect fact</error>). ### Determining Span of Errors We label the smallest possible consecutive span that specifies the error given the BHC as a context. Removing further parts from the span would remove important information. A useful heurist...
[22]

Unsupported facts, including condition/procedure/medication/time/location/number/name/word/other
[23]

error_type

Incorrect fact And below is the detailed guideline, and we label error spans with the <error> tag (e.g. <error class="error_type">incorrect fact</error>). ### Determining Span of Errors We label the smallest possible consecutive span that specifies the error given the BHC as a context. Removing further parts from the span would remove important informatio...

[1] [8]

Return only the revised summary text (no explanations or markup). --- [SOURCE FACTS / BRIEF HOSPITAL COURSE] {{ context }} 7 [DRAFT SUMMARY WITH FEEDBACK ANNOTATIONS] {{ summary }} <|end|> Self-refinement revision with detectors Prompt <|system|> You are a clinical summarization assistant. <|end|> <|user|> You will be given: - A brief hospital course (SOU...

[2] [9]

Treat the SOURCE FACTS as authoritative

[3] [10]

</error>: - If the content is unsupported or contradicted by the SOURCE FACTS -> remove or correct it

For each segment wrapped in <error> ... </error>: - If the content is unsupported or contradicted by the SOURCE FACTS -> remove or correct it. - If the content is partially correct -> rewrite it using information from the SOURCE FACTS. - If the content is accurate -> keep it, but remove the <error> tags

[4] [11]

Check for missing or incomplete information in the DRAFT SUMMARY compared to the SOURCE FACTS, especially: - Chief complaint / reason for visit - Presenting symptoms - Procedures performed - Medications (new, changed, or discontinued) - Vital signs - Key laboratory or imaging findings

[5] [12]

Do **not** invent or infer new diagnoses, medications, procedures, or dates

[6] [13]

Keep professional tone and structure suitable for a discharge or after-visit summary

[7] [14]

Prefer terms and values exactly as stated in the SOURCE FACTS

[8] [15]

a key procedure or discharge medication), ADD it

If the DRAFT SUMMARY is missing clinically important info that IS present in the SOURCE FACTS (e.g. a key procedure or discharge medication), ADD it

[9] [16]

You were admitted

Always start the summary with "You were admitted" and refer to the patient as you / your

[10] [17]

Return only the revised summary text (no explanations or markup). --- [SOURCE FACTS / BRIEF HOSPITAL COURSE] {{ context }} [DRAFT SUMMARY WITH FEEDBACK ANNOTATIONS] {{ summary_with_errors }} <|end|> A.3 Hallucination Detection Medalign zero-shot Prompt <|system|> You are a helpful assistant that helps patients understand their medical records. <|end|> <|u...

[11] [19]

Please take your medications as prescribed

Incorrect fact And below is the detailed guideline, and we label error spans with the <error> tag (e.g. <error>incorrect fact</error>). ### Allowed General Medical Knowledge and Medical Advice We allow general medical knowledge and advice that is often part of the AVS. Usually, these are information that are not specific for the hospital course given in t...

[12] [21]

We performed an <error>esophageal-gastro-duodenoscopy (EGD).<error>

Incorrect fact And below is the detailed guideline, and we label error spans with the <error> tag (e.g. <error>incorrect fact</error>). ### Determining Span of Errors We label the smallest possible consecutive span that specifies the error given the BHC as a context. Removing further parts from the span would remove important information. A useful heurist...

[13] [22]

Unsupported facts, including condition/procedure/medication/time/location/number/name/word/other

[14] [23]

error_type

Incorrect fact And below is the detailed guideline, and we label error spans with the <error> tag (e.g. <error class="error_type">incorrect fact</error>). ### Determining Span of Errors We label the smallest possible consecutive span that specifies the error given the BHC as a context. Removing further parts from the span would remove important informatio...