arxiv: 2605.11963 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

What Does It Mean for a Medical AI System to Be Right?

Antony Gitau

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical AIcorrectnessexplainabilityground truth labelsautomation biasclinical metricsaccountabilitymultiple myeloma

0 comments

The pith

Correctness in medical AI is a multi-dimensional concept that cannot be reduced to benchmark performance alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper grounds its question in the concrete task of automatically classifying plasma cells in digitized bone marrow smears to diagnose multiple myeloma. It argues that an AI system counts as right only when expert-labeled datasets are available, model outputs can be explained and interpreted, evaluation metrics align with clinical needs, and accountability is properly shared between humans and the algorithm. A sympathetic reader would care because narrow focus on benchmark scores risks deploying systems that fail in real clinical settings despite high test accuracy. The argument develops through four themes that show how ground truth labels can be unstable, AI decisions can be opaque even when confident, standard metrics can miss clinical meaning, and time pressure can produce automation bias.

Core claim

Correctness in medical AI is not a singular property reducible to benchmark performance, but a multi-dimensional concept involving the availability of expertly labeled medical datasets, the explainability and interpretability of model outputs, the clinical meaningfulness of evaluation metrics, and the distribution of accountability in human-AI workflows. The claim is developed through four interrelated themes: the instability of ground truth labels, the opacity of overconfident AI, the inadequacy of standard clinical metrics, and the risk of automation bias in time-pressured clinical settings, all illustrated in the plasma cell classification task.

What carries the argument

The four interrelated themes of label instability, AI output opacity, metric inadequacy, and automation bias that together establish a multi-dimensional account of correctness instead of a single benchmark score.

If this is right

Development of medical AI must treat creation of high-quality expert-labeled datasets as a core requirement rather than an afterthought.
Systems must include built-in mechanisms that allow clinicians to inspect and understand the reasoning behind each classification.
Evaluation protocols should replace or supplement standard accuracy measures with metrics that directly reflect diagnostic impact on patient care.
Deployment protocols must specify how responsibility is divided between the AI output and the human clinician to reduce automation bias.
In time-sensitive diagnostic workflows such as bone marrow analysis, correctness assessments need to include checks for over-reliance on automated suggestions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Regulatory bodies evaluating medical AI could require evidence across all four dimensions rather than benchmark scores alone before granting approval.
The same multi-dimensional lens could be applied to AI tools in other high-stakes fields such as radiology or pathology to identify similar hidden failure modes.
Training programs for clinicians using AI might need explicit modules on recognizing and countering automation bias in addition to technical literacy.
Future dataset curation efforts for medical imaging could prioritize stability and expert consensus metrics as first-class design goals.

Load-bearing premise

The four themes of unstable labels, opaque outputs, mismatched metrics, and automation bias comprehensively define correctness and that lessons from the plasma cell task apply to medical AI in general.

What would settle it

A controlled clinical deployment of a plasma cell classifier that produces reliable diagnoses despite unstable expert labels, non-explainable outputs, metrics that do not track clinical outcomes, and evidence of automation bias among users would show that the multi-dimensional account is not necessary.

read the original abstract

This paper examines what it means for a medical AI system to be right by grounding the question in a specific clinical context: the automatic classification of plasma cells in digitized bone marrow smears for the diagnosis of multiple myeloma. Drawing on philosophy of science and research ethics, the paper argues that correctness in medical AI is not a singular property reducible to benchmark performance, but a multi-dimensional concept involving the availability of expertly labeled medical datasets, the explainability and interpretability of model outputs, the clinical meaningfulness of evaluation metrics, and the distribution of accountability in human-AI workflows. As such, the paper develops this argument through four interrelated themes: the instability of ground truth labels, the opacity of overconfident AI, the inadequacy of standard clinical metrics, and the risk of automation bias in time-pressured clinical settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper recycles standard points about why benchmark accuracy alone doesn't capture correctness in medical AI, using one plasma cell task as illustration but adding no new concepts or evidence.

read the letter

The core claim is that being right in medical AI involves more than test scores—it also requires stable expert labels, interpretable outputs, meaningful metrics, and clear accountability between humans and models. The paper develops this through four themes drawn from plasma cell classification in bone marrow smears for multiple myeloma diagnosis. That framing is coherent on its own terms and ties philosophy of science ideas to a concrete clinical example without obvious internal contradictions.

Referee Report

1 major / 0 minor

Summary. The manuscript argues that correctness in medical AI is not reducible to benchmark performance but is instead a multi-dimensional concept. Grounded in the specific task of automatic plasma cell classification in digitized bone marrow smears for multiple myeloma diagnosis, it develops four interrelated themes—the instability of ground truth labels, the opacity of overconfident AI, the inadequacy of standard clinical metrics, and the risk of automation bias—drawing on philosophy of science and research ethics to emphasize expert-labeled datasets, explainability, clinically meaningful metrics, and accountability in human-AI workflows.

Significance. If the argument holds, the paper offers a useful conceptual framework for moving beyond narrow performance metrics in medical AI evaluation. By linking philosophical considerations to a concrete clinical example, it could help guide more responsible design and deployment of AI tools in diagnostic settings, particularly where human oversight and ethical accountability are central.

major comments (1)

The central claim that the four themes comprehensively characterize correctness in medical AI rests on extrapolation from a single plasma-cell classification case study. Without explicit discussion of how (or whether) these themes manifest in other medical imaging domains—such as radiology tumor detection or digital pathology slide analysis—the generalizability of the multi-dimensional framework remains under-supported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We are pleased to provide the following point-by-point response and will incorporate revisions as outlined.

read point-by-point responses

Referee: The central claim that the four themes comprehensively characterize correctness in medical AI rests on extrapolation from a single plasma-cell classification case study. Without explicit discussion of how (or whether) these themes manifest in other medical imaging domains—such as radiology tumor detection or digital pathology slide analysis—the generalizability of the multi-dimensional framework remains under-supported.

Authors: We appreciate the referee's observation regarding the scope of our case study. The manuscript deliberately focuses on the plasma cell classification task in bone marrow smears as a detailed, real-world example to ground the philosophical and ethical analysis, allowing for a thorough examination of the four themes in a specific clinical workflow. However, we agree that to better support the broader applicability of the multi-dimensional framework, additional discussion is warranted. In the revised version, we will include a dedicated paragraph in the Discussion section that addresses how each of the four themes—ground truth instability, model opacity, metric inadequacy, and automation bias—can manifest in other domains such as radiology (e.g., tumor detection in CT scans) and digital pathology (e.g., slide analysis for cancer diagnosis). This will draw on existing literature to illustrate parallels without overgeneralizing from our single case. We believe this addition will strengthen the paper's contribution while maintaining its focus. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper advances a conceptual, philosophical, and ethical argument that correctness in medical AI is multi-dimensional rather than reducible to benchmark performance. It draws on four themes illustrated by one plasma-cell classification case but contains no equations, derivations, fitted parameters, quantitative predictions, or formal deductive steps. No load-bearing premise reduces to a self-citation, self-definition, or renamed input; the discussion is self-contained interpretive analysis grounded in external philosophy of science and research ethics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions from philosophy and ethics that the listed dimensions are essential to correctness; no free parameters, new axioms, or invented entities are introduced.

axioms (1)

domain assumption Correctness in medical AI requires consideration of multiple dimensions beyond performance metrics.
This is the foundational premise grounding the four themes in the abstract.

pith-pipeline@v0.9.0 · 5425 in / 1268 out tokens · 138948 ms · 2026-05-13T05:34:20.621956+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Northcutt, Lu Jiang, and Isaac L

Curtis G. Northcutt, Lu Jiang, and Isaac L. Chuang. Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021

work page 2021
[2]

Elmore, Gary M

Joann G. Elmore, Gary M. Longton, Patricia A. Car- ney, et al. Diagnostic concordance among pathol- ogists interpreting breast biopsy specimens.JAMA, 313(11):1122–1132, 2015

work page 2015
[3]

The 5th edition of the world health or- ganization classification of hematolymphoid tumors

Weijie Li. The 5th edition of the world health or- ganization classification of hematolymphoid tumors. InLeukemia [Internet]. StatPearls Publishing / NCBI Bookshelf, August 2022. Available from: NCBI Bookshelf

work page 2022
[4]

University of Chicago Press, 1979

Ludwik Fleck.Genesis and Development of a Scien- tific Fact. University of Chicago Press, 1979. Origi- nally published in 1935

work page 1979
[5]

European Parliament and Council of the European Union. Regulation (eu) 2024/1689 of the european parliament and of the council of 13 june 2024 lay- ing down harmonised rules on artificial intelligence (artificial intelligence act).https://eur-lex. europa.eu/eli/reg/2024/1689/oj, 2024. Official Journal of the European Union, L 2024/1689, 12 July 2024

work page 2024
[6]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural net- works. InProceedings of the 34th International Con- ference on Machine Learning (ICML), pages 1321–

work page
[7]

Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran

Jeremy Nixon, Michael W. Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring calibration in deep learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition Workshops (CVPRW), 2019

work page 2019
[8]

Kelly, Alan Karthikesalingam, Mustafa Suleyman, Greg Corrado, and Dominic King

Christopher J. Kelly, Alan Karthikesalingam, Mustafa Suleyman, Greg Corrado, and Dominic King. Second opinion needed: communicating uncertainty in medi- cal machine learning.Nature Medicine, 27:203–209, 2021

work page 2021
[9]

World Health Organization, Geneva, 2024

World Health Organization.Ethics and governance of artificial intelligence for health: Guidance on large multi-modal models. World Health Organization, Geneva, 2024

work page 2024
[10]

Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: Decide-ai

DECIDE-AI Expert Group. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: Decide-ai. Nature Medicine, 28(5):924–933, 2022

work page 2022
[11]

Popper.The Logic of Scientific Discovery

Karl R. Popper.The Logic of Scientific Discovery. Hutchinson, London, 1959

work page 1959
[12]

Rivera, David Moher, Melanie J

Xiaoxuan Liu, Steven C. Rivera, David Moher, Melanie J. Calvert, Alastair K. Denniston, SPIRIT- AI, and CONSORT-AI Working Group. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the consort-ai exten- sion.Nature Medicine, 26(9):1364–1374, 2020

work page 2020
[13]

Rivera, Xiaoxuan Liu, Adrian W

Steven C. Rivera, Xiaoxuan Liu, Adrian W. Chan, Alastair K. Denniston, Melanie J. Calvert, SPIRIT- AI, and CONSORT-AI Working Group. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the spirit-ai extension.Nature Medicine, 26(9):1351–1363, 2020. 4

work page 2020