pith. sign in

arxiv: 2606.28960 · v1 · pith:I2WXCCJOnew · submitted 2026-06-27 · 💻 cs.AI · q-bio.QM· stat.AP

Expert Evaluation of Clinical AI Tools on Real Point-of-Care Clinical Queries

Pith reviewed 2026-06-30 09:26 UTC · model grok-4.3

classification 💻 cs.AI q-bio.QMstat.AP
keywords clinical AI evaluationpoint-of-care queriesexpert physician judgmentspecialized vs general modelsReal-POCQi benchmarkblinded comparisonaccuracy clinical utility
0
0 comments X

The pith

A specialized clinical AI tool outperforms three general-purpose models by 25 to 39 percentage points when physicians judge answers to real point-of-care questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most AI evaluations use hypothetical or exam-style questions, but physicians ask millions of real clinical questions each week in practice. This paper tests answers to 620 actual queries submitted by doctors across 30 specialties, with 149 matched specialist physicians making blinded head-to-head comparisons between a specialized tool and three frontier general models. The specialized tool received higher scores on accuracy, clinical utility, source quality, verifiability, and completeness, with win margins of 25 to 39 points that held in multiple sensitivity checks. Results were similar on an additional set of 187 questions, and the queries are released as a public benchmark. The work shows that real query distributions and expert judges can surface performance differences that matter for clinical decision support.

Core claim

On the Real-POCQi set of 620 real-world point-of-care queries, blinded specialty-matched physicians scored the specialized clinical tool highest across all five dimensions of clinical decision support, with win differences ranging from 25 to 39 percentage points over the general-purpose models; these margins remained consistent in sensitivity analyses and on the HealthBench questions.

What carries the argument

Blinded head-to-head comparison of tool outputs on the Real-POCQi benchmark of real physician-submitted queries, scored by 149 specialty-matched practicing physicians.

If this is right

  • Evaluations of clinical AI should draw from real query distributions rather than hypothetical or exam-style questions.
  • Specialty-matched expert judges can detect larger performance gaps than general evaluators.
  • Targeted engineering and customization can produce measurable gains on dimensions such as source quality and verifiability.
  • LLM judges and expert judges reach similar top-model rankings even while differing systematically in their assessments.
  • The advantage of the specialized tool holds across checks for citation display, answer length, and query source.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks built only on exam questions may miss the distribution shifts that matter most for actual clinical use.
  • Public release of Real-POCQi allows ongoing tracking of whether general models close the gap on verifiability and completeness over time.
  • If the observed margins persist in live deployment, hospitals may need to weigh specialization when selecting decision-support tools.

Load-bearing premise

The 620 queries and 149 physician graders form a representative and unbiased sample of real clinical needs and judgments, with blinding sufficient to prevent favoritism.

What would settle it

A larger replication using queries from additional clinical platforms or unblinded graders showing no significant win difference would undermine the reported margins.

read the original abstract

Physicians now pose millions of clinical questions to AI tools each week, yet these tools are evaluated largely on hypothetical or exam-style questions, not those actually asked in practice. We report a blinded evaluation built on 620 Real-world Point-Of-Care Queries (Real-POCQi) submitted to the OpenEvidence (OE) platform by physicians spanning 30 specialties, as well as 187 questions from HealthBench. 149 practicing physicians across 36 states made head-to-head comparisons between answers from three frontier general-purpose models (Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5) and a specialized clinical tool (OE), with graders matched to each question's specialty. When comparing answers along five dimensions relevant to clinical decision support -- accuracy, clinical utility, source quality, verifiability, & completeness -- physicians scored the specialized tool highest on all axes; in the primary analysis on Real-POCQi, win differences (margins between win and loss rates) ranged from 25 to 39 percentage points (p<0.001). Results remained consistent in sensitivity analyses stratifying by citation display, answer length, OE-user status, and Real-POCQi versus HealthBench. In parallel, LLM judges were found to systematically differ from expert judges, though both generally agreed on the best model. These findings underscore two conclusions: (i) AI tool evaluations should reflect real-world query distributions and use expert judges that mirror the specialization defining modern medicine and (ii) the consistent advantage of the specialized tool over general-purpose models does not necessarily mean that the latter cannot serve similar purposes, but that targeted engineering and customization can yield meaningful gains in performance for its users. We release Real-POCQi as a public benchmark, as well as the prespecified statistical analysis for reproducing results of this study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports a blinded head-to-head expert evaluation of the specialized clinical AI tool OpenEvidence (OE) against three general-purpose frontier models (Claude Opus 4.8, Gemini 3.1 Pro, GPT-5.5) on 620 real point-of-care queries (Real-POCQi) plus 187 HealthBench questions. 149 specialty-matched physicians rated answers on accuracy, clinical utility, source quality, verifiability, and completeness; OE won by 25–39 percentage-point margins (p<0.001) on Real-POCQi, with results stable in sensitivity analyses. The authors conclude that real-world query distributions and expert judges are essential for clinical AI evaluation and that targeted engineering yields measurable gains. They release Real-POCQi and the prespecified analysis plan.

Significance. If the superiority claim holds under rigorous blinding, the work provides direct evidence that specialized clinical tools can outperform general-purpose models on authentic physician queries, supporting the broader argument that evaluation protocols must use real query distributions and domain-matched expert raters rather than exam-style items. The public release of Real-POCQi and the analysis plan is a concrete contribution that enables future replication and benchmarking.

major comments (2)
  1. [Abstract/Methods] Abstract and Methods (blinding protocol): The headline 25–39 pp win margins rest on the assumption that the 149 graders could not identify tool origin. The manuscript states only that the evaluation was “blinded,” with no description of answer reformatting, removal of citation-style signatures, length normalization, or post-hoc de-blinding checks. Because source quality and verifiability are two of the five scored axes—precisely the dimensions on which OE is engineered to differ—this omission leaves open the possibility that graders de-blinded and favored OE, directly threatening the causal interpretation of the reported differences.
  2. [Methods] Methods (query sampling and rater reliability): No details are provided on how the 620 Real-POCQi queries were sampled from the OE platform, what exclusion criteria were applied, or how inter-rater reliability was quantified among the 149 specialty-matched physicians. These omissions are load-bearing for the claim that the sample is representative of real clinical decision-support needs.
minor comments (2)
  1. [Results] The sensitivity analyses stratifying by citation display and answer length are helpful but would be strengthened by reporting the exact distribution of answer lengths and citation counts per model.
  2. [Results] The statement that “LLM judges were found to systematically differ from expert judges” would benefit from a quantitative comparison (e.g., agreement rates or rank correlations) rather than a qualitative summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and will revise the manuscript accordingly to provide the requested details.

read point-by-point responses
  1. Referee: [Abstract/Methods] Abstract and Methods (blinding protocol): The headline 25–39 pp win margins rest on the assumption that the 149 graders could not identify tool origin. The manuscript states only that the evaluation was “blinded,” with no description of answer reformatting, removal of citation-style signatures, length normalization, or post-hoc de-blinding checks. Because source quality and verifiability are two of the five scored axes—precisely the dimensions on which OE is engineered to differ—this omission leaves open the possibility that graders de-blinded and favored OE, directly threatening the causal interpretation of the reported differences.

    Authors: We agree with the referee that additional details on the blinding protocol are essential for interpreting the results, particularly given the importance of source quality and verifiability. In the revised manuscript, we will expand the Methods section to describe the specific procedures used to maintain blinding, including reformatting of answers, removal of citation-style signatures, length normalization, and any post-hoc assessments of de-blinding. These additions will address the concern and strengthen the causal claims. revision: yes

  2. Referee: [Methods] Methods (query sampling and rater reliability): No details are provided on how the 620 Real-POCQi queries were sampled from the OE platform, what exclusion criteria were applied, or how inter-rater reliability was quantified among the 149 specialty-matched physicians. These omissions are load-bearing for the claim that the sample is representative of real clinical decision-support needs.

    Authors: We acknowledge these omissions in the current Methods section. The revised manuscript will include a detailed description of the query sampling process from the OE platform, the exclusion criteria applied, and the quantification of inter-rater reliability (such as through statistical measures like Fleiss' kappa). This will provide transparency and support the representativeness of the Real-POCQi dataset. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with independent expert ratings

full rationale

The paper reports a head-to-head blinded evaluation of AI tool answers on 620 Real-POCQi queries using 149 specialty-matched physician graders. Primary results are win differences (25-39 pp, p<0.001) across five axes computed via standard statistical tests on collected ratings. No equations, derivations, fitted parameters, or self-citations appear in the provided text. The central claims rest on external expert judgments and prespecified analysis, not on any reduction of outputs to inputs by construction. This is the most common honest finding for empirical comparison studies.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This empirical evaluation study introduces no free parameters, no invented entities, and relies only on standard statistical assumptions for win-rate comparisons and significance testing.

axioms (1)
  • standard math Standard assumptions underlying two-sample proportion tests and p-value calculations for win/loss rates
    Invoked when reporting p<0.001 for the primary win differences

pith-pipeline@v0.9.1-grok · 5905 in / 1320 out tokens · 29283 ms · 2026-06-30T09:26:29.151011+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    2025 physicians AI report

    Offcall. 2025 physicians AI report. https://2025-physicians-ai-report.offcall.com/. Accessed: 2026- 6-24

  2. [2]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Appl

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Appl. Sci. (Basel), 11(14):6421, July 2021. 19 1.00 0.22 0.350.35 1.00 0.220.23 0.32 1.00 0.380.38 0.51 1.00 0.25 0.40 0.46 1.00 0.26 0.24 0.31 0.0 0.3 0.6 0.9 Ac...

  3. [3]

    HealthBench: Evaluating large language models towards improved human health.arXiv [cs.CL], May 2025

    Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Qui˜ nonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Hei- decke, and Karan Singhal. HealthBench: Evaluating large language models towards improved human health.arXiv [cs.CL], May 2025

  4. [4]

    Holistic evaluation of large language models for medical tasks with MedHELM.Nat

    Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, Hao Qiu, Shrey Jain, Leonardo Schettini, Mehr Kashyap, Jason Alan Fries, Akshay Swaminathan, Philip Chung, Fateme Nateghi Haredasht, Ivan Lopez, Asad Aali, Gabriel Tse, Ashwin Nayak, Shivam Vedak, Sneha S Jain, Birju Pate...

  5. [5]

    General-purpose large language models outperform specialized clinical AI tools on medical benchmarks.Nat

    Krithik Vishwanath, Anton Alyakin, Mrigayu Ghosh, Ali Hage, Sean N Neifert, Cordelia Orillac, Nataniel J Mandelberg, Hammad A Khan, Jin Vivian Lee, Jie J Yao, William Robert Small, Aakaash Varma, D Brock Hewitt, Yindalon Aphinyanaphongs, Daniel Alexander Alber, and Eric Karl Oer- mann. General-purpose large language models outperform specialized clinical ...

  6. [6]

    Medical large language model benchmarks should prioritize construct validity.Proc

    Ahmed Alaa, Thomas Hartvigsen, Niloufar Golchini, Shiladitya Dutta, Frances Dean, Inioluwa Deb- orah Raji, and Travis Zack. Medical large language model benchmarks should prioritize construct validity.Proc. Int. Conf. Mach. Learn., March 2025

  7. [7]

    Large language models encode clinical knowledge.Nature, 620(7972):172–180, August 2023

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Sch¨ arli, Aakanksha Chowdhery, Philip Mans- field, Dina Demner-Fushman, Blaise Ag¨ uera Y Arcas, Dale Webster, Greg S Corr...

  8. [8]

    JudgmentBench: Comparing rubric and preference evaluation for quality assessment.arXiv [cs.CL], May 2026

    Russell Yang, Ruishi Chen, Pierce Kelaita, Riya Ranjan, Sibo Ma, Charles Dickens, Matthew Guil- lod, Megan Ma, and Julian Nyarko. JudgmentBench: Comparing rubric and preference evaluation for quality assessment.arXiv [cs.CL], May 2026

  9. [9]

    Neither valid nor reliable? investigating the use of LLMs as judges.Adv

    Khaoula Chehbouni, Mohammed Haddou, Jackie Chi Kit Cheung, and Golnoosh Farnadi. Neither valid nor reliable? investigating the use of LLMs as judges.Adv. Neural Inf. Process. Syst., August 2025

  10. [10]

    LLMs judging LLMs: A simplex perspective.International Conference on Artificial Intelligence and Statistics, 2026

    Patrick Vossler, Fan Xia, Yifan Mai, Adarsh Subbaswamy, and Jean Feng. LLMs judging LLMs: A simplex perspective.International Conference on Artificial Intelligence and Statistics, 2026

  11. [11]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference.ICML, abs/2403.04132:8359–8388, March 2024

  12. [12]

    BERTopic: Neural topic modeling with a class-based TF-IDF procedure

    Maarten Grootendorst. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv [cs.CL], March 2022

  13. [13]

    Human-AI co-design for clinical prediction models.NPJ Digit

    Jean Feng, Avni Kothari, Patrick Vossler, Andrew Bishara, Lucas Zier, Newton Addo, Aaron Korn- blith, Yan Shuo Tan, and Chandan Singh. Human-AI co-design for clinical prediction models.NPJ Digit. Med., pages 1–11, June 2026

  14. [14]

    The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities.Eur

    Stuart J Pocock, Cono A Ariti, Timothy J Collier, and Duolao Wang. The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities.Eur. Heart J., 33(2):176–182, January 2012

  15. [15]

    Statistical inference with win statistics in cluster-randomized trials with composite outcomes.arXiv [stat.ME], April 2026

    Xi Fang, Guangyu Tong, Yuan Huang, F Perry Wilson, Patrick J Heagerty, and Fan Li. Statistical inference with win statistics in cluster-randomized trials with composite outcomes.arXiv [stat.ME], April 2026

  16. [16]

    AgentClinic: a multimodal benchmark for tool-using clinical AI agents.NPJ Digit

    Samuel Schmidgall, Rojin Ziaei, Carl Harris, Ji Woong Kim, Eduardo Pontes Reis, Jeffrey Jopling, and Michael Moor. AgentClinic: a multimodal benchmark for tool-using clinical AI agents.NPJ Digit. Med., April 2026

  17. [17]

    Autonomous medical evaluation for guideline adherence of large language models.NPJ Digit

    Dennis Fast, Lisa C Adams, Felix Busch, Conor Fallon, Marc Huppertz, Robert Siepmann, Philipp Prucker, Nadine Bayerl, Daniel Truhn, Marcus Makowski, Alexander L¨ oser, and Keno K Bressem. Autonomous medical evaluation for guideline adherence of large language models.NPJ Digit. Med., 7(1):358, December 2024

  18. [18]

    Benchmarking cognitive biases in large language models as evaluators

    Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking cognitive biases in large language models as evaluators. InFindings of the Association for Computational Linguistics ACL 2024, pages 517–545, Stroudsburg, PA, USA, 2024. Association for Computational Linguistics

  19. [19]

    Judg- ing LLM-as-a-judge with MT-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judg- ing LLM-as-a-judge with MT-bench and chatbot arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, November 2023

  20. [20]

    Replacing judges with juries: Evaluating LLM generations with a panel of diverse models.arXiv [cs.CL], April 2024

    Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models.arXiv [cs.CL], April 2024

  21. [21]

    ChatEval: Towards better LLM-based evaluators through multi-agent debate

    Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. ChatEval: Towards better LLM-based evaluators through multi-agent debate. InThe Twelfth International Conference on Learning Representations, 2024

  22. [22]

    VERDICT: A library for compound LLM judge systems

    Nimit Kalra and Leonard Tang. VERDICT: A library for compound LLM judge systems

  23. [23]

    BRIDGE: benchmarking large language models for understanding real-world clinical practice texts

    Jiageng Wu, Bowen Gu, Ren Zhou, Kevin Xie, Doug Snyder, Yixing Jiang, Valentina Carducci, Richard Wyss, Rishi J Desai, Emily Alsentzer, Leo Anthony Celi, Adam Rodman, Sebastian Schneeweiss, Jonathan H Chen, Santiago Romero-Brufau, Kueiyu Joshua Lin, and Jie Yang. BRIDGE: benchmarking large language models for understanding real-world clinical practice tex...

  24. [24]

    Prediction-powered ranking of large language models

    Ivi Chatzi, Eleni Straitouri, Suhas Thejaswi, and Manuel Gomez Rodriguez. Prediction-powered ranking of large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, November 2024

  25. [25]

    The clinician and dataset shift in artificial 21 intelligence.N

    Samuel G Finlayson, Adarsh Subbaswamy, Karandeep Singh, John Bowers, Annabel Kupke, Jonathan Zittrain, Isaac S Kohane, and Suchi Saria. The clinician and dataset shift in artificial 21 intelligence.N. Engl. J. Med., 385(3):283–286, July 2021

  26. [26]

    Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare.npj Digital Medicine, 5(1):1–9, May 2022

    Jean Feng, Rachael V Phillips, Ivana Malenica, Andrew Bishara, Alan E Hubbard, Leo A Celi, and Romain Pirracchio. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare.npj Digital Medicine, 5(1):1–9, May 2022

  27. [27]

    Clinical trials for continuously monitored and updated AI systems.Nat

    Wouter A C van Amsterdam, Michael Oberst, Jean Feng, Jenna Wiens, Shengpu Tang, Shalmali Joshi, Rajesh Ranganath, Mark Sendak, Uri Shalit, Julia E Vogt, Brett Beaulieu-Jones, Muhammad Mamdani, David Kent, Patrick J Heagerty, Thomas R Fleming, and Anna Goldenberg. Clinical trials for continuously monitored and updated AI systems.Nat. Med., pages 1–3, April 2026

  28. [28]

    WildBench: Benchmarking LLMs with challenging tasks from real users in the wild.International Conference on Learning Representations, 2025:47852–47870, May 2025

    Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. WildBench: Benchmarking LLMs with challenging tasks from real users in the wild.International Conference on Learning Representations, 2025:47852–47870, May 2025

  29. [29]

    In Paul Lavrakas, editor,Encyclopedia of survey research methods, pages 272–

    Favorability ratings. In Paul Lavrakas, editor,Encyclopedia of survey research methods, pages 272–

  30. [30]

    Sage Publications, Inc., 2455 Teller Road, Thousand Oaks California 91320 United States of America, September 2008

  31. [31]

    Adding error bars to evals: A statistical approach to language model evaluations.arXiv [stat.AP], November 2024

    Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations.arXiv [stat.AP], November 2024

  32. [32]

    On extending the bradley-terry model to accommodate ties in paired comparison experiments.J

    Roger R Davidson. On extending the bradley-terry model to accommodate ties in paired comparison experiments.J. Am. Stat. Assoc., 65(329):317, March 1970

  33. [33]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324, December 1952. 22