arxiv: 2605.03916 · v1 · submitted 2026-05-05 · 💻 cs.CL · cs.AI

Recognition: unknown

Atomic Fact-Checking Increases Clinician Trust in Large Language Model Recommendations for Oncology Decision Support: A Randomized Controlled Trial

Denise Bernhardt, Erik Thiele Orberg, Florian Matthes, Jan C. Peeken, Keno Bressem, Linus Marx, Lisa C. Adams, Marcus R. Makowski, Markus Graf, Sebastian Ziegelmayer, Stephanie E. Combs

Pith reviewed 2026-05-07 04:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords atomic fact-checkingclinician trustlarge language modelsoncology decision supportrandomized controlled trialAI transparencyguideline linkageclinical AI

0 comments

The pith

Decomposing AI oncology recommendations into verifiable atomic claims linked to guidelines substantially increases clinician trust.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether breaking down large language model treatment recommendations in oncology into small, individually verifiable facts each tied directly to source guideline documents builds more trust among clinicians than standard ways of explaining the AI output. A randomized controlled trial with 356 clinicians who supplied 7,476 separate trust ratings found that this atomic fact-checking method produced a large increase in trust, lifting the share of clinicians who trusted the recommendations from 26.9 percent to 66.5 percent. Conventional transparency tools such as source citations or summary explanations improved trust only modestly over a no-explanation baseline and followed a clear dose-response pattern. If the finding holds, the approach could help clinicians feel confident enough to incorporate reliable AI suggestions into high-stakes cancer care while still verifying each piece against official guidelines.

Core claim

Atomic fact-checking decomposes large language model recommendations for oncology into individually verifiable claims each linked to source guideline documents and thereby produces substantially higher clinician trust than traditional explainability approaches in a randomized controlled trial of 356 clinicians generating 7,476 trust ratings, raising the proportion of trusting clinicians from 26.9% to 66.5% with Cohen's d of 0.94 while traditional mechanisms yielded smaller dose-dependent gains from d = 0.25 to 0.50.

What carries the argument

Atomic fact-checking: the decomposition of AI treatment recommendations into individually verifiable claims each linked to source guideline documents.

If this is right

Clinicians express trust in AI oncology recommendations more than twice as often when each claim is individually verifiable against guidelines.
Traditional transparency features improve trust over baseline in a dose-dependent way but fall short of the gains from atomic fact-checking.
Higher trust levels could support safer and faster integration of large language models into clinical oncology decision support.
The large effect size indicates that atomic fact-checking addresses a major barrier to clinician acceptance of AI tools in high-stakes settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If higher trust leads to greater actual use, atomic fact-checking could speed adoption of AI assistance in oncology while preserving clinician oversight.
The same decomposition approach might be tested in other medical specialties or in non-clinical high-stakes domains such as legal or financial advice.
Scaling atomic fact-checking to routine practice would require reliable automated extraction and verification systems.
Long-term studies linking trust ratings to patient outcomes would show whether the observed increase in trust improves clinical decisions.

Load-bearing premise

The atomic fact-checking process was carried out accurately and without errors or biases in extracting claims and linking them to guidelines, and the measured trust ratings reflect how clinicians would actually behave when caring for real patients.

What would settle it

A replication study in which the fact-checking step introduces detectable errors or omissions, or in which trust is measured by observed clinical actions rather than survey ratings, and which finds no increase in trust under atomic fact-checking compared with traditional transparency methods.

read the original abstract

Question: Does atomic fact-checking, which decomposes AI treatment recommendations into individually verifiable claims linked to source guideline documents, increase clinician trust compared to traditional explainability approaches? Findings: In this randomized trial of 356 clinicians generating 7,476 trust ratings, atomic fact-checking produced a large effect on trust (Cohen's d = 0.94), increasing the proportion of clinicians expressing trust from 26.9% to 66.5%. Traditional transparency mechanisms showed a dose-response gradient of improvement over baseline (d = 0.25 to 0.50). Meaning: Decomposing AI recommendations into individually verifiable claims linked to source guidelines produces substantially higher clinician trust than traditional explainability approaches in high-stakes clinical decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper reports a randomized controlled trial with 356 clinicians providing 7,476 trust ratings on LLM-generated oncology treatment recommendations. It claims that 'atomic fact-checking'—decomposing recommendations into individually verifiable claims linked to source guidelines—produces a large increase in clinician trust (Cohen's d = 0.94), raising the proportion expressing trust from 26.9% to 66.5%, and outperforms traditional transparency mechanisms (dose-response d = 0.25–0.50).

Significance. If the result holds under behavioral validation, the work offers concrete evidence that a structured fact-checking intervention can substantially improve trust in AI clinical decision support. The randomized design, large sample, and specific effect sizes are notable strengths for an empirical study in this area. The finding could inform design of explainability tools for high-stakes medical AI, though its practical value depends on whether survey trust translates to changed clinical decisions.

major comments (3)

[Methods] Methods section (Trust Rating Protocol): The outcome is measured solely via repeated self-report ratings collected inside a survey interface. No behavioral endpoint (e.g., actual selection of the AI recommendation, treatment-plan change, or override of guidelines) is reported. Because the central claim concerns trust 'in oncology decision support,' the absence of validation that ratings predict real uptake undermines the practical interpretation of the d = 0.94 effect size.
[Results] Results (Atomic Fact-Checking Condition): The manuscript provides no quantitative validation or inter-rater reliability data on the accuracy of claim extraction and guideline linking in the atomic fact-checking arm. If extraction errors or selective linking occurred, they could systematically inflate trust ratings, directly affecting the headline comparison to baseline and traditional conditions.
[Statistical Analysis] Statistical Analysis subsection: With 7,476 ratings nested within 356 clinicians, the reported Cohen's d values and proportion shifts (26.9 % → 66.5 %) require explicit description of the mixed-effects model, handling of repeated measures, and any multiplicity correction. Without these details it is impossible to judge whether the large effect is robust to clustering or to the exact definition of 'expressing trust.'

minor comments (3)

[Abstract] Abstract: Specify whether the trust outcome was binary or Likert-scale and give the exact wording of the item used to compute the 26.9 % / 66.5 % proportions.
[Methods] Methods: Provide the pre-registration identifier (if any) and a CONSORT-style flow diagram to allow verification of randomization and attrition.
[Discussion] Discussion: Add a short paragraph on generalizability beyond the surveyed clinician population and oncology scenarios tested.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their insightful comments on our randomized controlled trial. We agree that additional details on the statistical analysis and explicit discussion of limitations are warranted. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Methods] Methods section (Trust Rating Protocol): The outcome is measured solely via repeated self-report ratings collected inside a survey interface. No behavioral endpoint (e.g., actual selection of the AI recommendation, treatment-plan change, or override of guidelines) is reported. Because the central claim concerns trust 'in oncology decision support,' the absence of validation that ratings predict real uptake undermines the practical interpretation of the d = 0.94 effect size.

Authors: We acknowledge that our primary outcome relies on self-reported trust ratings rather than direct behavioral measures of clinical decision-making. This design choice reflects the ethical and logistical challenges of conducting a study that alters real patient care in a high-stakes oncology setting. The trial was intended to isolate the effect of atomic fact-checking on trust as a foundational step. We will revise the manuscript to include a more explicit discussion of this limitation in the Discussion section, noting that while self-reported trust is a widely used proxy in AI evaluation studies, future research should examine whether these ratings correlate with actual changes in treatment recommendations or adherence to guidelines. We maintain that the observed large effect size provides important evidence for improving AI transparency mechanisms even if behavioral validation is pending. revision: partial
Referee: [Results] Results (Atomic Fact-Checking Condition): The manuscript provides no quantitative validation or inter-rater reliability data on the accuracy of claim extraction and guideline linking in the atomic fact-checking arm. If extraction errors or selective linking occurred, they could systematically inflate trust ratings, directly affecting the headline comparison to baseline and traditional conditions.

Authors: We agree that validation of the fact-checking process is important. The atomic fact-checking was implemented by the research team following a detailed protocol that involved decomposing recommendations into atomic claims and linking each to specific sections of the NCCN guidelines. However, we did not perform or report a formal inter-rater reliability assessment for this process. This represents a limitation of the current work. In the revised manuscript, we will expand the Methods section to provide a more thorough description of the fact-checking protocol and add a statement in the Limitations section acknowledging the absence of quantitative validation metrics. We do not have the data to retroactively compute inter-rater reliability, but we believe the standardized protocol minimizes the risk of systematic bias. revision: partial
Referee: [Statistical Analysis] Statistical Analysis subsection: With 7,476 ratings nested within 356 clinicians, the reported Cohen's d values and proportion shifts (26.9 % → 66.5 %) require explicit description of the mixed-effects model, handling of repeated measures, and any multiplicity correction. Without these details it is impossible to judge whether the large effect is robust to clustering or to the exact definition of 'expressing trust.'

Authors: We thank the referee for pointing out the need for greater statistical transparency. The analysis employed a mixed-effects logistic regression model with clinician as a random effect to account for the repeated measures structure (multiple ratings per clinician). The model was implemented in R using the lme4 package. 'Expressing trust' was defined as a binary outcome where ratings of 4 or 5 on the 5-point Likert scale were coded as 1. Cohen's d was calculated from the model coefficients. No multiplicity correction was applied because the primary hypothesis tests were pre-specified in the analysis plan. We will fully revise the Statistical Analysis subsection to include the model specification, software details, and any sensitivity analyses performed. These additions will allow readers to better assess the robustness of the reported effect sizes. revision: yes

standing simulated objections not resolved

The study does not include behavioral endpoints, and we cannot provide data on whether trust ratings predict real-world clinical decisions as no such measures were collected in this survey-based trial.

Circularity Check

0 steps flagged

No circularity: direct empirical RCT results with no derivations or self-referential reductions

full rationale

The paper reports a randomized controlled trial collecting 7,476 trust ratings from 356 clinicians across explanation conditions. No equations, fitted parameters, predictions, or first-principles derivations are present that could reduce the reported effect sizes (Cohen's d = 0.94 for atomic fact-checking) to inputs by construction. Trust proportions (26.9% to 66.5%) are computed directly from the randomized survey responses. Self-citations, if any, are not load-bearing for the central empirical claim. The skeptic concern addresses measure validity (self-report vs. behavioral endpoint) rather than circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is an empirical clinical trial. It rests on standard RCT assumptions and the validity of the fact-checking intervention rather than new mathematical entities or fitted parameters.

axioms (2)

domain assumption Clinician trust ratings collected via survey are a valid proxy for acceptance and intended use of AI recommendations in clinical practice.
The primary outcome is trust; the study assumes this measure predicts real-world behavior.
domain assumption The atomic fact-checking decomposition accurately represents the original LLM recommendation without introducing factual errors or changing clinical meaning.
The intervention's benefit depends on correct extraction and linkage to guidelines.

pith-pipeline@v0.9.0 · 5472 in / 1552 out tokens · 92334 ms · 2026-05-07T04:13:49.216640+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Singhal, K. et al. Large Language Models Encode Clinical Knowledge. Preprint at https://doi.org/10.48550/ARXIV.2212.13138 (2022)

work page doi:10.48550/arxiv.2212.13138 2022
[2]

Kung, T. H. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2, e0000198 (2023)

2023
[3]

& Petro, J

Lee, P., Bubeck, S. & Petro, J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med 388, 1233–1239 (2023)

2023
[4]

Thirunavukarasu, A. J. et al. Large language models in medicine. Nat Med 29, 1930–1940 (2023)

1930
[5]

& Topol, E

Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat Med 28, 31–38 (2022)

2022
[6]

J., Karthikesalingam, A., Suleyman, M., Corrado, G

Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med 17, 195 (2019)

2019
[7]

Han, R. et al. Randomised controlled trials evaluating artificial intelligence in clinical practice: a scoping review. The Lancet Digital Health 6, e367–e373 (2024)

2024
[8]

& Wyatt, J

Jones, C., Thornton, J. & Wyatt, J. C. Artificial intelligence and clinical decision support: clinicians’ perspectives on trust, trustworthiness, and liability. Medical Law Review 31, 501–520 (2023)

2023
[9]

C., Teran, M

Rojas, J. C., Teran, M. & Umscheid, C. A. Clinician Trust in Artificial Intelligence. Critical Care Clinics 39, 769–782 (2023)

2023
[10]

F., Kors, J

Markus, A. F., Kors, J. A. & Rijnbeek, P. R. The role of explainability in creating trustworthy artificial intelligence for health care: A comprehensive survey of the terminology, design choices, and evaluation strategies. Journal of Biomedical Informatics 113, 103655 (2021)

2021
[11]

Explainability for artificial intelligence in healthcare: a multidisciplinary perspective

The Precise4Q consortium et al. Explainability for artificial intelligence in healthcare: a multidisciplinary perspective. BMC Med Inform Decis Mak 20, 310 (2020)

2020
[12]

Wei, J. et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Preprint at https://doi.org/10.48550/ARXIV.2201.11903 (2022)

work page internal anchor Pith review doi:10.48550/arxiv.2201.11903 2022
[13]

Gombolay, G. Y. et al. Effects of explainable artificial intelligence in neurology decision support. Ann Clin Transl Neurol 11, 1224–1235 (2024)

2024
[14]

Ploug, T., Sundby, A., Moeslund, T. B. & Holm, S. Population Preferences for Performance and Explainability of Artificial Intelligence in Health Care: Choice-Based Conjoint Survey. J Med Internet Res 23, e26611 (2021)

2021
[15]

Vladika, J. et al. Improving Reliability and Explainability of Medical Question Answering through Atomic Fact Checking in Retrieval-Augmented LLMs. Preprint at https://doi.org/10.48550/ARXIV.2505.24830 (2025)

work page doi:10.48550/arxiv.2505.24830 2025
[16]

Thinking, Fast and Slow

Kahneman, D. Thinking, Fast and Slow. (Farrar, Straus and Giroux, New York, 2011)

2011
[17]

Tonekaboni, S., Joshi, S., McCradden, M. D. & Goldenberg, A. What Clinicians Want: Contextualizing Explainable Machine Learning for Clinical End Use. in Proceedings of the 4th Machine Learning for Healthcare Conference (eds. Doshi-Velez, F. et al.) vol. 106 359–380 (PMLR, 2019)

2019
[18]

& Beam, A

Ghassemi, M., Oakden-Rayner, L. & Beam, A. L. The false hope of current approaches to explainable artificial intelligence in health care. The Lancet Digital Health 3, e745–e750 (2021)

2021
[19]

Holzinger, A., Biemann, C., Pattichis, C. S. & Kell, D. B. What do we need to build explainable AI systems for the medical domain? Preprint at https://doi.org/10.48550/ARXIV.1712.09923 (2017). 19

work page doi:10.48550/arxiv.1712.09923 2017
[20]

Gaube, S. et al. Do as AI say: susceptibility in deployment of clinical decision-aids. npj Digit. Med. 4, 31 (2021)

2021
[21]

Jacobs, M. et al. How machine-learning recommendations influence clinician treatment selections: the example of antidepressant selection. Transl Psychiatry 11, 108 (2021)

2021
[22]

Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 25, 44–56 (2019)

2019
[23]

Verghese, A., Shah, N. H. & Harrington, R. A. What This Computer Needs Is a Physician: Humanism and Artificial Intelligence. JAMA 319, 19 (2018)

2018
[24]

McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020)

2020
[25]

A., Rimm, D

Bera, K., Schalper, K. A., Rimm, D. L., Velcheti, V. & Madabhushi, A. Artificial intelligence in digital pathology — new tools for diagnosis and precision oncology. Nat Rev Clin Oncol 16, 703–715 (2019)

2019
[26]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 1, 206–215 (2019)

2019
[27]

Huynh, E. et al. Artificial intelligence in radiation oncology. Nat Rev Clin Oncol 17, 771– 781 (2020). Author Contributions LA: Conceptualization, study design, formal analysis, statistical analysis, funding acquisition, and writing of the original draft. JCP: Conceptualization, study design, formal analysis, funding acquisition, and manuscript writing a...

2020