Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

· 2026 · cs.CV · arXiv 2605.16386

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). This effect disproportionately affects the clinically critical extremes where accurate scoring most impacts screening decisions for cognitive impairment. Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect. Our findings extend the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, and highlight the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.

representative citing papers

Risk Stratification for ICU Delirium using Pervasive Ambient Sensing Information

cs.LG · 2026-06-17 · unverdicted · novelty 4.0

Ambient sound and light data from ICU rooms predict delirium with AUC 0.80 using convolutional neural networks, with sound as the dominant predictor, on data from 309 patients across 9 ICUs.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Risk Stratification for ICU Delirium using Pervasive Ambient Sensing Information cs.LG · 2026-06-17 · unverdicted · none · ref 16 · internal anchor
Ambient sound and light data from ICU rooms predict delirium with AUC 0.80 using convolutional neural networks, with sound as the dominant predictor, on data from 309 patients across 9 ICUs.

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

fields

years

verdicts

representative citing papers

citing papers explorer