AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

Kyunga Kim; Kyu Yeon Hur; Sangwon Baek

arxiv: 2606.03198 · v1 · pith:QTMP5PRUnew · submitted 2026-06-02 · 💻 cs.CL · cs.AI

AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

Sangwon Baek , Kyu Yeon Hur , Kyunga Kim This is my paper

Pith reviewed 2026-06-28 10:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords AI ratersclinical decision supportrubric anchoringscoring protocolstype 2 diabetespharmacotherapyLLM evaluationlinear mixed effects models

0 comments

The pith

Rubric-anchored scoring lets AI raters discriminate between clinical decision outputs while rubric-free scoring collapses them into uniformly high marks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how AI raters score outputs from clinical decision support systems on seven questions about type 2 diabetes pharmacotherapy at 12-month follow-up. It compares a protocol that supplies patient-specific rubrics against one that supplies none, using linear mixed effects models to isolate the effect of the scoring protocol across multiple design factors. Rubric-anchored scoring produces lower average scores, wider score spreads, and stronger separation between document-referenced and baseline outputs. Rubric-free scoring yields high scores in a narrow band and masks differences both between outputs and between rater models. The authors conclude that rubric anchoring is required when evaluation criteria depend on patient or jurisdiction details that the rater models cannot recover from their training data alone.

Core claim

Across all seven questions, AI raters assigned scores under the Gold Rubric protocol that were 7.69 to 49.64 points lower and 1.68 to 3.67 times wider in interquartile range than under the Non-Gold-Rubric protocol; within each question the anchored protocol increased discrimination between document-referenced generation and baseline outputs by factors of 1.76 to 5.10 while also exposing substantial variation across rater models that the unanchored protocol suppressed.

What carries the argument

The factorial comparison of Gold Rubric (patient-specific rubric supplied) versus Non-Gold-Rubric (no rubric supplied) scoring protocols, analyzed via linear mixed effects models that cross the protocol factor with CDSS model, prompt configuration, rater model, and prompt variables.

If this is right

Gold Rubric scoring produces consistently lower and more variable scores than Non-Gold-Rubric scoring across all tested questions.
Gold Rubric scoring amplifies separation between document-referenced and baseline CDSS outputs by 1.76 to 5.10 times.
Gold Rubric scoring reveals behavioral differences among rater models that Non-Gold-Rubric scoring suppresses.
Rubric-free scoring cannot serve as a substitute when criteria depend on patient- or jurisdiction-specific information.
The pattern holds when the same LLMs act simultaneously as both clinical decision support systems and raters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluation pipelines for other clinical domains that rely on jurisdiction-specific guidelines will likely need anchored rubrics rather than relying on rater model knowledge.
Creating reusable patient-specific rubric templates could reduce the cost of maintaining discriminative power when scaling AI rater use.
Future work could test whether hybrid protocols that supply partial rubrics recover some but not all of the discrimination lost in fully rubric-free scoring.

Load-bearing premise

The seven evaluation questions require patient-specific or jurisdiction-specific criteria that the rater models cannot recover from their parametric knowledge alone.

What would settle it

A new set of clinical questions whose criteria can be answered from general medical knowledge alone, scored under both protocols, would falsify the claim if the Non-Gold-Rubric protocol produced discrimination equal to or greater than the Gold Rubric protocol.

Figures

Figures reproduced from arXiv: 2606.03198 by Kyunga Kim, Kyu Yeon Hur, Sangwon Baek.

**Figure 1.** Figure 1: Score distribution under Non-GR and GR protocols by evaluation question. Paired violin plot per question (Q1–Q7)—Non-GR (left, light blue) vs. GR (right, dark navy). White diamonds mark estimated mean scores, box overlays mark median and IQR, and dotted reference lines at 20 and 80 demarcate the floor and ceiling bands. Per-question Kolmogorov–Smirnov statistic, p-value, and change in ceiling-band (≥ 80) f… view at source ↗

**Figure 2.** Figure 2: Pooled main-effect estimates for the five non-protocol fixed-effect factors. Forest plot of τ coefficients showing each non-reference level relative to its reference (Baseline, Qwen-3.5, Lenient, Normal): diamonds mark point estimates; horizontal bars mark 95% Wald CIs; the vertical dashed line marks τ = 0. Bold rows at the top of each factor band report the factor-level Type 3 Wald joint test. CDSS prompt… view at source ↗

**Figure 3.** Figure 3: Protocol × factor interaction heatmap across the seven evaluation questions. Each cell shows the per-question Type 3 Wald χ 2/df statistic (top) and the per-question Type 3 joint Wald p-value (bottom); cell color encodes χ 2/df on a log scale. Row labels include each factor’s degrees of freedom and its Fisher-combined p-value across Q1–Q7. The scoring protocol moderates every non-protocol factor’s effect; … view at source ↗

**Figure 4.** Figure 4: Per-level decomposition of CDSS-side protocol × factor interactions. Four panels: (a) CDSS prompt configuration (DRG vs. Baseline); (b)–(d) CDSS model (GLM-5, Gemma-4, Nemotron each vs. Qwen-3.5). Each panel shows per-question simple effects (level vs. reference) under Non-GR (navy) and GR (maroon). The bracket above each bar pair reports the per-question amplification ratio (AR; Appendix G.5), color-coded… view at source ↗

**Figure 5.** Figure 5: Estimated mean score by evaluation question, CDSS prompt configuration stratum, and scoring protocol. Per-question bars from the stratified LMM (Appendix G.5)—four bars per question: Baseline×NonGR, Baseline×GR, DRG×Non-GR, DRG×GR. Under Non-GR the DRG–Baseline gap is narrow and roughly constant across questions; under GR the gap widens substantially in every question except Q5, where DRG cells were alrea… view at source ↗

**Figure 6.** Figure 6: Per-level decomposition of rater-side protocol × factor interactions. Five panels in two rows: top— (a)–(c) rater model (GLM-5, Gemma-4, Nemotron each vs. Qwen-3.5); bottom—(d)–(e) prompt character (Moderate, Strict each vs. Lenient). Each panel shows per-question simple effects (level vs. reference) under Non-GR (navy) and GR (maroon). The bracket above each bar pair reports the per-question amplification… view at source ↗

**Figure 7.** Figure 7: Per-level decomposition of the protocol × prompt type interaction. Two panels: (a) CoT vs. Normal; (b) Self-Consistency vs. Normal. Each panel shows per-question simple effects (level vs. reference) under Non-GR (navy) and GR (maroon); the bracket above each bar pair reports the per-question amplification ratio (AR; Appendix G.5), all of which fall in the unstable regime (gray dashed). The protocol × promp… view at source ↗

**Figure 8.** Figure 8: Per-rater behavioral profile under each scoring protocol. Five panels (a–e) compare the four LLMs in their AI rater role (Qwen-3.5, GLM-5, Gemma-4, Nemotron) along five behavioral dimensions: (a) mean score, (b) run-to-run reproducibility, (c) responsiveness to prompt character, (d) responsiveness to prompt type, (e) self-preference effect. Each rater is shown with two bars per dimension—Non-GR in light bl… view at source ↗

**Figure 9.** Figure 9: Per-rater mean score across prompt character levels under each scoring protocol. Line plot of mean score by AI rater (Qwen-3.5, GLM-5, Gemma-4, Nemotron) across the three prompt character levels (Lenient, Moderate, Strict) under (a) Non-GR and (b) GR. The Lenient–Strict score gap amplifies under GR for every rater (Non-GR 1.3–3.5 points to GR 4.2–8.2 points), indicating that prompt-level calibration direct… view at source ↗

**Figure 10.** Figure 10: Per-rater mean score across prompt type levels under each scoring protocol. Line plot of mean score by AI rater across the three prompt type levels (Normal, CoT, Self-Consistency) under (a) Non-GR and (b) GR; rater marker mapping matches [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Mean score by rater × CDSS model × scoring protocol. Two 4 × 4 heatmaps (Non-GR left, GR right): rows index the AI rater (judge); columns index the CDSS model (generator). Boxed cells mark self-pairs (rater scoring its own CDSS output). Each rater’s row varies by less than 7 points across CDSS columns within each protocol, indicating that the per-rater protocol-induced shifts are properties of the rater i… view at source ↗

**Figure 12.** Figure 12: Position-bias diagnostics for prompt order. Two heatmaps of per-question τ estimates: (a) Non-GR main effect (Order B vs. A; Order C vs. A); (b) protocol × prompt-order interaction. Boxed cells exceed the pre-specified threshold |τ | ≥ 2.0. Prompt-section ordering is negligible under Non-GR (panel a, largest |τ | = 2.11); under GR a small but controllable order interaction emerges (panel b, largest |τ | =… view at source ↗

read the original abstract

Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior across evaluation conditions has not been quantitatively characterized. We address this gap through a factorial study of AI rater behavior in adult type 2 diabetes (T2D) pharmacotherapy at 12-month outpatient follow-up, a clinical task involving complex decision-making operationalized across seven evaluation questions. Four open-source LLMs served simultaneously as clinical decision support system (CDSS) models and AI raters. Each CDSS output was scored under two scoring protocols: a rubric-anchored Gold Rubric (GR) protocol incorporating a patient-specific rubric, and a rubric-free Non Gold Rubric (Non-GR) protocol. Linear mixed effects models crossed the scoring protocol factor with five design factors -- CDSS model, CDSS prompt configuration (document-referenced generation [DRG] vs.\ Baseline), rater model, prompt character, and prompt type -- and estimated main effects together with their protocol interactions. Across all questions, AI raters yielded consistently higher scores within a very narrow range (74--78 points on average) under Non-GR compared to those under GR (7.69 to 49.64 points lower mean scores; 1.68 to 3.67 times wider interquartile ranges). Within each question, GR amplified the AI rater's discrimination between DRG and Baseline CDSS outputs by factors of 1.76 to 5.10, while also revealing substantial behavioral variation across rater models that Non-GR suppressed. These findings support rubric anchoring as the scoring protocol that preserves discriminative power in clinical AI evaluation; rubric-free scoring cannot substitute when questions require patient-specific or jurisdiction-specific criteria that rater models cannot infer from parametric knowledge alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rubric-anchored scoring boosts AI rater discrimination in this T2D task by 1.76-5.10x over rubric-free, but the paper gives no data or code to check the numbers and the causal claim about model knowledge is untested.

read the letter

The core finding is that a patient-specific gold rubric makes the four LLMs separate DRG from baseline CDSS outputs more clearly, while rubric-free scoring squeezes scores into a tight 74-78 band and hides model differences. The factorial design crossed scoring protocol with CDSS model, prompt config, rater model, and prompt type, then fit linear mixed effects models to pull out the interactions. That setup is a reasonable way to quantify protocol effects in one clinical area.

What stands out is the consistent pattern across seven questions: GR widens the interquartile ranges and amplifies discrimination. The numbers on amplification (1.76 to 5.10) and score drops (7.69 to 49.64 points) are the concrete result.

The gaps are straightforward. The abstract alone supplies the effect sizes and no tables, no model versions, no exact rubric text, and no code or data. Without those, the claimed factors cannot be reproduced. More importantly, the interpretation that Non-GR fails because raters cannot pull patient- or jurisdiction-specific criteria from parametric knowledge rests on the score gap itself. There is no separate probe asking the models to state those criteria independently, so prompt length, anchoring, or general leniency remain possible explanations. The LME models do not include a knowledge-probe factor.

This work is aimed at people building or auditing clinical AI evaluation pipelines. A reader already working on rubric design or rater reliability would get a usable data point on protocol sensitivity. It is not yet ready for citation because the supporting materials are missing.

I would send it to peer review once the authors release the full statistical output, the rubrics, and the generation code. The design is sound enough to merit referee time even if the knowledge claim needs tightening.

Referee Report

3 major / 2 minor

Summary. The paper reports a factorial empirical study of four open-source LLMs used simultaneously as CDSS generators and AI raters on a type-2-diabetes pharmacotherapy task. Each CDSS output is scored under a Gold-Rubric (GR) protocol that supplies an explicit patient-specific rubric versus a rubric-free Non-GR protocol. Linear mixed-effects models crossing scoring protocol with CDSS model, prompt configuration (DRG vs. Baseline), rater model, and other factors show that Non-GR produces higher mean scores (74–78) in a narrow range while GR lowers means by 7.69–49.64 points, widens interquartile ranges 1.68–3.67×, and amplifies discrimination between DRG and Baseline outputs by factors of 1.76–5.10. The authors conclude that rubric anchoring is required to preserve discriminative power when evaluation questions involve patient- or jurisdiction-specific criteria that rater models cannot recover from parametric knowledge alone.

Significance. If the reported protocol-by-prompt interactions replicate, the work supplies concrete quantitative evidence that rubric-free LLM scoring can compress variance and mask differences that rubric-anchored scoring reveals, with direct consequences for how clinical AI systems are benchmarked. The use of the same LLMs in both generation and rating roles and the crossing of five design factors are strengths that allow isolation of protocol effects.

major comments (3)

[Abstract] Abstract (final sentence) and study design: the claim that score compression under Non-GR arises because 'rater models cannot infer from parametric knowledge alone' the patient- or jurisdiction-specific criteria is inferred from the GR vs. Non-GR contrast rather than directly tested. No control condition (e.g., a knowledge-probe prompt asking the same four LLMs to generate or apply the exact rubric criteria independently of the scoring task) is described, leaving open alternative explanations such as prompt-length effects, scale-anchoring differences, or general leniency in open-ended scoring.
[Abstract] Abstract and Methods (implied): effect sizes, amplification factors (1.76–5.10), and interaction terms are stated without accompanying raw data, model versions, exact rubric text, full statistical output tables, or code, so the reported numbers cannot be verified or sensitivity-checked.
[Methods] Linear mixed-effects analysis: the models include protocol interactions with CDSS prompt configuration and rater model but, per the skeptic note, omit any knowledge-probe factor that would isolate the parametric-knowledge assumption; this omission makes the causal attribution to 'inability to infer patient-specific criteria' load-bearing yet untested.

minor comments (2)

The seven evaluation questions are referenced but never listed or characterized with respect to which criteria are patient- versus jurisdiction-specific; adding an explicit table would clarify the scope of the claim.
Notation for the two prompt configurations (DRG vs. Baseline) and the five crossed factors should be defined once in a table or equation for readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point-by-point below, acknowledging where the design leaves certain interpretations inferential rather than directly tested. We propose targeted revisions to clarify claims and improve reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract (final sentence) and study design: the claim that score compression under Non-GR arises because 'rater models cannot infer from parametric knowledge alone' the patient- or jurisdiction-specific criteria is inferred from the GR vs. Non-GR contrast rather than directly tested. No control condition (e.g., a knowledge-probe prompt asking the same four LLMs to generate or apply the exact rubric criteria independently of the scoring task) is described, leaving open alternative explanations such as prompt-length effects, scale-anchoring differences, or general leniency in open-ended scoring.

Authors: We agree that the interpretation of why Non-GR compresses scores is inferred from the protocol contrast rather than isolated by a dedicated knowledge-probe condition. The GR vs. Non-GR manipulation directly tests the effect of rubric anchoring, and the resulting differences in variance and discrimination between DRG and Baseline outputs support the role of explicit criteria. However, we acknowledge that alternatives such as prompt length or general leniency remain possible. We will revise the abstract's final sentence and the discussion to frame the parametric-knowledge explanation as a supported interpretation rather than a definitive causal claim, and we will add an explicit limitation noting the absence of a knowledge-probe control. revision: partial
Referee: [Abstract] Abstract and Methods (implied): effect sizes, amplification factors (1.76–5.10), and interaction terms are stated without accompanying raw data, model versions, exact rubric text, full statistical output tables, or code, so the reported numbers cannot be verified or sensitivity-checked.

Authors: We accept this point. The submitted manuscript reports summary statistics and model-derived quantities without the underlying data, full model outputs, exact rubric wording, or analysis code. In the revision we will add these materials as supplementary files, specify all model versions and hyperparameters, include the complete rubric text, and provide a public repository link containing the raw scores, LME model code, and sensitivity analyses so that the reported effect sizes and interactions can be independently verified. revision: yes
Referee: [Methods] Linear mixed-effects analysis: the models include protocol interactions with CDSS prompt configuration and rater model but, per the skeptic note, omit any knowledge-probe factor that would isolate the parametric-knowledge assumption; this omission makes the causal attribution to 'inability to infer patient-specific criteria' load-bearing yet untested.

Authors: The LME specification was chosen to estimate the protocol-by-prompt-configuration and protocol-by-rater interactions that were the primary scientific targets. We agree that a knowledge-probe factor would provide a more direct test of the parametric-knowledge hypothesis. The current design nevertheless isolates the effect of rubric provision itself. We will revise the Methods and Discussion sections to state the limitation explicitly and to qualify the causal language around parametric knowledge, while retaining the observed protocol effects as the core empirical result. revision: partial

Circularity Check

0 steps flagged

No circularity: fully empirical comparison of scoring protocols

full rationale

The paper conducts a factorial experiment scoring CDSS outputs from four LLMs under GR vs Non-GR protocols, then fits linear mixed-effects models to estimate protocol main effects and interactions with design factors. All reported quantities (mean score differences, IQR ratios, discrimination amplification factors 1.76-5.10) are direct statistical summaries of the observed data rather than any derivation that reduces to its own inputs by construction. No equations, ansatzes, uniqueness theorems, or self-citations are invoked as load-bearing premises; the interpretation that Non-GR cannot substitute for patient-specific criteria is presented as an inference from the empirical pattern, not a definitional or fitted tautology. The study is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of linear mixed effects models for crossed factors and on the representativeness of the four open-source LLMs chosen for the task.

axioms (1)

domain assumption Linear mixed effects model assumptions hold for the crossed design of scoring protocol with CDSS model, prompt configuration, rater model, prompt character, and prompt type
Paper uses LME to estimate main effects and protocol interactions across all questions.

pith-pipeline@v0.9.1-grok · 5861 in / 1233 out tokens · 73620 ms · 2026-06-28T10:45:21.986882+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 5 canonical work pages · 3 internal anchors

[1]

(2023) Interobserver Variability Studies in Diagnostic Imaging: A Methodological Systematic Review.British Journal of Radiology, 96(1148), 20220972

Quinn, L., et al. (2023) Interobserver Variability Studies in Diagnostic Imaging: A Methodological Systematic Review.British Journal of Radiology, 96(1148), 20220972

2023
[2]

L., et al

Di Forti, C. L., et al. (2025) Inter-Rater Reliability of Psychiatric Diagnosis: A Systematic Review and Meta-Analysis.European Psychiatry, 68(suppl. 1), S191–S192

2025
[3]

S., et al

Tawfik, D. S., et al. (2018) Physician Burnout, Well-being, and Work Unit Safety Grades in Relationship to Reported Medical Errors.Mayo Clinic Proceedings, 93(11), 1571–1580

2018
[4]

(2026) Holistic Evaluation of Large Language Models for Medical Tasks with MedHELM

Bedi, S., et al. (2026) Holistic Evaluation of Large Language Models for Medical Tasks with MedHELM. Nature Medicine, 32(3), 943–951

2026
[5]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Arora, R. K., et al. (2025) HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv:2505.08775

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

(2025) Medical Hallucination in Foundation Models and Their Impact on Healthcare

Kim, Y., et al. (2025) Medical Hallucination in Foundation Models and Their Impact on Healthcare. medRxiv:2025.02.28.25323115. 11

2025
[7]

Omar, M., et al. (2025) Multi-Model Assurance Analysis Showing Large Language Models Are Highly Vulnerable to Adversarial Hallucination Attacks During Clinical Decision Support.Communications Medicine, 5, article 330

2025
[8]

(2024) LLM Evaluators Recognize and Favor Their Own Generations.Advances in Neural Information Processing Systems, 37

Panickssery, A., et al. (2024) LLM Evaluators Recognize and Favor Their Own Generations.Advances in Neural Information Processing Systems, 37

2024
[9]

(2024) Towards Understanding Sycophancy in Language Models.International Conference on Learning Representations

Sharma, M., et al. (2024) Towards Understanding Sycophancy in Language Models.International Conference on Learning Representations

2024
[10]

Sclar, M., et al. (2024) Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying about Prompt Formatting.International Conference on Learning Representations

2024
[11]

(2023) Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.Advances in Neural Information Processing Systems, 36

Zheng, L., et al. (2023) Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.Advances in Neural Information Processing Systems, 36

2023
[12]

(2025) Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge.International Conference on Learning Representations

Ye, J., et al. (2025) Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge.International Conference on Learning Representations

2025
[13]

(2024) Benchmarking Cognitive Biases in Large Language Models as Evaluators

Koo, R., et al. (2024) Benchmarking Cognitive Biases in Large Language Models as Evaluators. Findings of the Association for Computational Linguistics: ACL 2024, 517–545

2024
[14]

CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation

Zhu, Z., et al. (2026) CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation. arXiv:2603.01865

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

(2024) Large Language Models are not Fair Evaluators.Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 9440–9450

Wang, P., et al. (2024) Large Language Models are not Fair Evaluators.Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 9440–9450

2024
[16]

(2024) Split and Merge: Aligning Position Biases in LLM-based Evaluators.Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 11084–11108

Li, Z., et al. (2024) Split and Merge: Aligning Position Biases in LLM-based Evaluators.Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 11084–11108

2024
[17]

(2026) Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior

Nasser, W. (2026) Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior. arXiv:2601.05114

work page arXiv 2026
[18]

Alignment faking in large language models

Greenblatt, R., et al. (2024) Alignment Faking in Large Language Models. arXiv:2412.14093

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

(2025) A Psychometric Framework for Evaluating and Shaping Personality Traits in Large Language Models.Nature Machine Intelligence, 7, 1954–1968

Serapio-García, G., et al. (2025) A Psychometric Framework for Evaluating and Shaping Personality Traits in Large Language Models.Nature Machine Intelligence, 7, 1954–1968

2025
[20]

Geathers, J., et al. (2025) Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs).Artificial Intelligence in Education, Lecture Notes in Computer Science, 15879, Springer, 231–245

2025
[21]

(2023) Verbosity Bias in Preference Labeling by Large Language Models.Workshop on Instruction Tuning and Instruction Following at NeurIPS 2023

Saito, K., et al. (2023) Verbosity Bias in Preference Labeling by Large Language Models.Workshop on Instruction Tuning and Instruction Following at NeurIPS 2023. [22]AmericanDiabetesAssociationProfessionalPracticeCommittee. (2026)9. PharmacologicApproaches to Glycemic Treatment: Standards of Care in Diabetes—2026.Diabetes Care, 49(suppl. 1), S183–S215. do...

work page doi:10.2337/dc26-s009 2023
[22]

D., & Thompson, R

Patterson, H. D., & Thompson, R. (1971) Recovery of Inter-Block Information When Block Sizes Are Unequal.Biometrika, 58(3), 545–554

1971
[23]

Dunnett, C. W. (1955) A Multiple Comparison Procedure for Comparing Several Treatments with a Control.Journal of the American Statistical Association, 50(272), 1096–1121

1955
[24]

C., & Bates, D

Pinheiro, J. C., & Bates, D. M. (2000)Mixed-Effects Models in S and S-PLUS. Springer

2000
[25]

Fisher, R. A. (1925)Statistical Methods for Research Workers. Oliver and Boyd

1925
[26]

Card, D., & Krueger, A. B. (1994) Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania.American Economic Review, 84(4), 772–793

1994
[27]

Massey, F. J. Jr. (1951) The Kolmogorov-Smirnov Test for Goodness of Fit.Journal of the American Statistical Association, 46(253), 68–78

1951
[28]

(2026) Detailed Criteria and Methods for the Application of Health Insurance Benefits (Pharmaceutical): Partial Amendment

Ministry of Health and Welfare. (2026) Detailed Criteria and Methods for the Application of Health Insurance Benefits (Pharmaceutical): Partial Amendment. MOHW Notice No. 2026-24, Health Insurance Review & Assessment Service, 29 Jan. 2026,www.hira.or.kr. 12 A Glossary of Terms Table 1:Glossary of study-specific terms. Term Definition AI rater LLM scoring ...

2026

[1] [1]

(2023) Interobserver Variability Studies in Diagnostic Imaging: A Methodological Systematic Review.British Journal of Radiology, 96(1148), 20220972

Quinn, L., et al. (2023) Interobserver Variability Studies in Diagnostic Imaging: A Methodological Systematic Review.British Journal of Radiology, 96(1148), 20220972

2023

[2] [2]

L., et al

Di Forti, C. L., et al. (2025) Inter-Rater Reliability of Psychiatric Diagnosis: A Systematic Review and Meta-Analysis.European Psychiatry, 68(suppl. 1), S191–S192

2025

[3] [3]

S., et al

Tawfik, D. S., et al. (2018) Physician Burnout, Well-being, and Work Unit Safety Grades in Relationship to Reported Medical Errors.Mayo Clinic Proceedings, 93(11), 1571–1580

2018

[4] [4]

(2026) Holistic Evaluation of Large Language Models for Medical Tasks with MedHELM

Bedi, S., et al. (2026) Holistic Evaluation of Large Language Models for Medical Tasks with MedHELM. Nature Medicine, 32(3), 943–951

2026

[5] [5]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Arora, R. K., et al. (2025) HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv:2505.08775

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

(2025) Medical Hallucination in Foundation Models and Their Impact on Healthcare

Kim, Y., et al. (2025) Medical Hallucination in Foundation Models and Their Impact on Healthcare. medRxiv:2025.02.28.25323115. 11

2025

[7] [7]

Omar, M., et al. (2025) Multi-Model Assurance Analysis Showing Large Language Models Are Highly Vulnerable to Adversarial Hallucination Attacks During Clinical Decision Support.Communications Medicine, 5, article 330

2025

[8] [8]

(2024) LLM Evaluators Recognize and Favor Their Own Generations.Advances in Neural Information Processing Systems, 37

Panickssery, A., et al. (2024) LLM Evaluators Recognize and Favor Their Own Generations.Advances in Neural Information Processing Systems, 37

2024

[9] [9]

(2024) Towards Understanding Sycophancy in Language Models.International Conference on Learning Representations

Sharma, M., et al. (2024) Towards Understanding Sycophancy in Language Models.International Conference on Learning Representations

2024

[10] [10]

Sclar, M., et al. (2024) Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying about Prompt Formatting.International Conference on Learning Representations

2024

[11] [11]

(2023) Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.Advances in Neural Information Processing Systems, 36

Zheng, L., et al. (2023) Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.Advances in Neural Information Processing Systems, 36

2023

[12] [12]

(2025) Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge.International Conference on Learning Representations

Ye, J., et al. (2025) Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge.International Conference on Learning Representations

2025

[13] [13]

(2024) Benchmarking Cognitive Biases in Large Language Models as Evaluators

Koo, R., et al. (2024) Benchmarking Cognitive Biases in Large Language Models as Evaluators. Findings of the Association for Computational Linguistics: ACL 2024, 517–545

2024

[14] [14]

CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation

Zhu, Z., et al. (2026) CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation. arXiv:2603.01865

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

(2024) Large Language Models are not Fair Evaluators.Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 9440–9450

Wang, P., et al. (2024) Large Language Models are not Fair Evaluators.Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 9440–9450

2024

[16] [16]

(2024) Split and Merge: Aligning Position Biases in LLM-based Evaluators.Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 11084–11108

Li, Z., et al. (2024) Split and Merge: Aligning Position Biases in LLM-based Evaluators.Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 11084–11108

2024

[17] [17]

(2026) Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior

Nasser, W. (2026) Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior. arXiv:2601.05114

work page arXiv 2026

[18] [18]

Alignment faking in large language models

Greenblatt, R., et al. (2024) Alignment Faking in Large Language Models. arXiv:2412.14093

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

(2025) A Psychometric Framework for Evaluating and Shaping Personality Traits in Large Language Models.Nature Machine Intelligence, 7, 1954–1968

Serapio-García, G., et al. (2025) A Psychometric Framework for Evaluating and Shaping Personality Traits in Large Language Models.Nature Machine Intelligence, 7, 1954–1968

2025

[20] [20]

Geathers, J., et al. (2025) Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs).Artificial Intelligence in Education, Lecture Notes in Computer Science, 15879, Springer, 231–245

2025

[21] [21]

(2023) Verbosity Bias in Preference Labeling by Large Language Models.Workshop on Instruction Tuning and Instruction Following at NeurIPS 2023

Saito, K., et al. (2023) Verbosity Bias in Preference Labeling by Large Language Models.Workshop on Instruction Tuning and Instruction Following at NeurIPS 2023. [22]AmericanDiabetesAssociationProfessionalPracticeCommittee. (2026)9. PharmacologicApproaches to Glycemic Treatment: Standards of Care in Diabetes—2026.Diabetes Care, 49(suppl. 1), S183–S215. do...

work page doi:10.2337/dc26-s009 2023

[22] [22]

D., & Thompson, R

Patterson, H. D., & Thompson, R. (1971) Recovery of Inter-Block Information When Block Sizes Are Unequal.Biometrika, 58(3), 545–554

1971

[23] [23]

Dunnett, C. W. (1955) A Multiple Comparison Procedure for Comparing Several Treatments with a Control.Journal of the American Statistical Association, 50(272), 1096–1121

1955

[24] [24]

C., & Bates, D

Pinheiro, J. C., & Bates, D. M. (2000)Mixed-Effects Models in S and S-PLUS. Springer

2000

[25] [25]

Fisher, R. A. (1925)Statistical Methods for Research Workers. Oliver and Boyd

1925

[26] [26]

Card, D., & Krueger, A. B. (1994) Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania.American Economic Review, 84(4), 772–793

1994

[27] [27]

Massey, F. J. Jr. (1951) The Kolmogorov-Smirnov Test for Goodness of Fit.Journal of the American Statistical Association, 46(253), 68–78

1951

[28] [28]

(2026) Detailed Criteria and Methods for the Application of Health Insurance Benefits (Pharmaceutical): Partial Amendment

Ministry of Health and Welfare. (2026) Detailed Criteria and Methods for the Application of Health Insurance Benefits (Pharmaceutical): Partial Amendment. MOHW Notice No. 2026-24, Health Insurance Review & Assessment Service, 29 Jan. 2026,www.hira.or.kr. 12 A Glossary of Terms Table 1:Glossary of study-specific terms. Term Definition AI rater LLM scoring ...

2026