Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark
Pith reviewed 2026-05-22 23:38 UTC · model grok-4.3
The pith
Large language models are not suitable for autonomous deployment in general practice based on a new expert-annotated benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that current LLMs are not suitable for autonomous deployment in clinical general practice. All realistic applications require continuous human oversight, and further optimization specifically tailored to the daily responsibilities of GPs remains essential. The conclusion rests on evaluations conducted with GPBench, a benchmark whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards.
What carries the argument
GPBench, a general practice benchmark whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards, used as the evaluation framework to measure LLM competencies against GP duties.
If this is right
- Current LLMs cannot reliably fulfill the duties of general practitioners.
- All realistic applications of LLMs in general practice require continuous human oversight.
- Further optimization specifically tailored to the daily responsibilities of GPs remains essential.
Where Pith is reading between the lines
- GPBench scores could serve as a baseline for measuring progress in future LLM versions aimed at primary care.
- Similar expert-annotated benchmarks may prove useful for evaluating LLMs in other medical specialties.
- Practical deployment strategies could combine LLMs with structured human review processes calibrated to GPBench performance levels.
Load-bearing premise
The GPBench framework, constructed from expert annotations aligned with routine clinical practice standards, accurately measures the competencies needed for autonomous general practitioner duties.
What would settle it
A demonstration that an LLM reaches human GP performance levels on GPBench tasks during real clinical deployment with no ongoing supervision would challenge the central claim.
read the original abstract
Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not suitable for autonomous deployment in clinical general practice and that all realistic applications require continuous human oversight; further optimization specifically tailored to the daily responsibilities of GPs remains essential.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GPBench, a competency-based evaluation framework and benchmark for LLMs in general practice, constructed from expert-annotated cases aligned with routine clinical practice standards. It evaluates ten state-of-the-art LLMs on this benchmark and concludes that current LLMs are not suitable for autonomous deployment in clinical general practice, that all realistic applications require continuous human oversight, and that further optimization tailored to GP daily responsibilities remains essential.
Significance. The introduction of a competency-based framework that moves beyond exam-style or simplified QA formats to align with real-world GP responsibilities is a clear strength and addresses a documented gap in existing medical AI benchmarks. If the GPBench items are shown to be reliable and if performance on them correlates with clinical safety outcomes, the results would provide actionable evidence on the current limitations of LLMs for high-stakes autonomous use and reinforce the importance of human-in-the-loop designs in clinical AI.
major comments (4)
- [Abstract] Abstract: the central claim that LLMs are unsuitable for autonomous deployment is asserted without any reported model scores, inter-annotator agreement, benchmark validation steps, or error analysis, preventing assessment of whether the evidence supports the conclusion.
- [GPBench construction] GPBench construction: the expert annotation process aligned with routine clinical standards is described, but no inter-annotator reliability statistics are supplied, leaving the consistency and validity of the benchmark data unverified.
- [Evaluation results] Evaluation results: no licensed-GP baseline scores on the identical GPBench items are reported, so it is impossible to distinguish whether low LLM performance reflects model inadequacy or benchmark difficulty.
- [Discussion] Discussion: the inference that continuous human oversight is required for all realistic applications rests on benchmark scores alone, without any mapping from observed errors to documented adverse events or safety data from primary care.
minor comments (1)
- [Abstract] The abstract would benefit from inclusion of the key quantitative LLM performance figures to allow readers to gauge the magnitude of the reported gaps.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback, which highlights important areas for strengthening the manuscript's claims and transparency. We address each major comment point-by-point below, proposing revisions where the points identify verifiable gaps in the current version.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that LLMs are unsuitable for autonomous deployment is asserted without any reported model scores, inter-annotator agreement, benchmark validation steps, or error analysis, preventing assessment of whether the evidence supports the conclusion.
Authors: We agree that the abstract would benefit from greater specificity to allow readers to evaluate the central claim. In the revised version, we will expand the abstract to include summary performance metrics across the ten LLMs (e.g., overall competency scores and key failure rates), a brief reference to the expert annotation process and inter-annotator agreement, and mention of the error analysis presented in the results section. This will make the evidence supporting the conclusion more immediately assessable while remaining within abstract length constraints. revision: yes
-
Referee: [GPBench construction] GPBench construction: the expert annotation process aligned with routine clinical standards is described, but no inter-annotator reliability statistics are supplied, leaving the consistency and validity of the benchmark data unverified.
Authors: This is a valid observation. While the annotation followed a multi-expert process with standardized clinical guidelines, inter-annotator agreement statistics were not reported in the original submission. We will add these metrics (e.g., Cohen's kappa or percentage agreement) to the methods section of the revised manuscript, calculated on a subset of overlapping annotations, to provide quantitative evidence of benchmark reliability. revision: yes
-
Referee: [Evaluation results] Evaluation results: no licensed-GP baseline scores on the identical GPBench items are reported, so it is impossible to distinguish whether low LLM performance reflects model inadequacy or benchmark difficulty.
Authors: We acknowledge the value of a direct human baseline for calibration. The GPBench items were constructed from routine clinical practice standards that licensed GPs are expected to meet, providing an implicit reference point. However, administering the full benchmark to a cohort of licensed GPs would require substantial additional resources, recruitment, and ethics approvals not feasible within the current study timeline. In the revision, we will explicitly discuss this limitation, provide qualitative context on expected GP performance based on the competency framework, and recommend human baseline collection as important future work. revision: partial
-
Referee: [Discussion] Discussion: the inference that continuous human oversight is required for all realistic applications rests on benchmark scores alone, without any mapping from observed errors to documented adverse events or safety data from primary care.
Authors: We agree that stronger linkage to real-world safety outcomes would reinforce the practical implications. The current inference draws from the benchmark's alignment with documented GP competencies, where failures in areas such as diagnosis, management, or communication carry inherent clinical risks. We will revise the discussion to more explicitly acknowledge the absence of direct adverse-event mapping, clarify that the recommendation for human oversight is based on competency gaps rather than proven harm, and highlight the need for future studies correlating GPBench performance with primary-care safety data. revision: yes
Circularity Check
No circularity: new expert-annotated benchmark with no equations, fits, or self-referential derivations
full rationale
The paper constructs GPBench from fresh expert annotations aligned to routine clinical standards and evaluates LLMs on it. No equations, parameters, or predictions are fitted to subsets of the data; the central claim (LLMs unsuitable for autonomous use) follows directly from observed scores on the new benchmark rather than reducing to any self-definition, self-citation chain, or renamed known result. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Domain expert annotations accurately reflect routine clinical practice standards for general practitioners.
Reference graph
Works this paper leans on
-
[1]
Menezes, M. C. S. et al. The potential of generative pre-trained transformer 4 (gpt-4) to analyse medical notes in three different languages: a retrospective model-evaluation study. The Lancet Digital Health 7, e35–e43 (2025)
work page 2025
-
[2]
Bellini, V. & Bignami, E. G. Generative pre-trained transformer 4 (gpt-4) in clinical settings. The Lancet Digital Health 7, e6–e7 (2025)
work page 2025
-
[3]
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023)
work page 2023
-
[4]
Singhal, K. et al. Toward expert-level medical question answering with large language models. Nature Medicine 1–8 (2025)
work page 2025
-
[5]
Strong, E. et al. Chatbot vs medical student performance on free-response clinical reasoning examinations. JAMA internal medicine 183, 1028–1030 (2023)
work page 2023
-
[6]
Gilson, A. et al. How does chatgpt perform on the united states medical licens- ing examination (usmle)? the implications of large language models for medical education and knowledge assessment. JMIR medical education 9, e45312 (2023)
work page 2023
-
[7]
Jin, D. et al. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences 11, 6421 (2021)
work page 2021
-
[8]
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 1–7 (2025)
work page 2025
-
[9]
Hurst, A. et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Jaech, A. et al. Openai o1 system card. arXiv preprint arXiv:2412.16720 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Team, G. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Yang, A. et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Liu, A. et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024). 17
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Chen, J. et al. Huatuogpt-o1, towards medical complex reasoning with llms. arXiv preprint arXiv:2412.18925 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
The european definition of general practice/family medicine-2023 edition
Europe, W. The european definition of general practice/family medicine-2023 edition. Barcelona: WONCA Europe (2023). URL https://www.woncaeurope. org/page/definition-of-general-practice-family-medicine
work page 2023
-
[17]
Scherger, J. E. Preparing the personal physician for practice (p4): essential skills for new family physicians and how residency programs may provide them. The Journal of the American Board of Family Medicine 20, 348–355 (2007)
work page 2007
-
[18]
McClelland, D. C. Testing for competence rather than for” intelligence.”. American psychologist 28, 1 (1973)
work page 1973
-
[19]
Boyatzis, R. E. The competent manager: A model for effective performance (John Wiley & Sons, 1991)
work page 1991
- [20]
- [21]
-
[22]
Collaborators, G. et al. Global, regional, and national age–sex specific all-cause and cause-specific mortality for 240 causes of death, 1990–2013: a systematic analysis for the global burden of disease study 2013. The Lancet 385, 117–171 (2015)
work page 1990
-
[23]
Zhou, M. et al. Cause-specific mortality for 240 causes in china during 1990–2013: a systematic subnational analysis for the global burden of disease study 2013. The Lancet 387, 251–272 (2016)
work page 1990
-
[24]
Peng, W. et al. Trends in major non-communicable diseases and related risk factors in china 2002–2019: an analysis of nationally representative survey data. The Lancet Regional Health–Western Pacific 43 (2024)
work page 2002
-
[25]
Chen, X. et al. Enhancing diagnostic capability with multi-agents conversational large language models. NPJ digital medicine 8, 159 (2025)
work page 2025
-
[26]
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 1–9 (2025). 18 Appendix A The competency indicators and definitions used in our proposed evaluation framework Table A1: The competency indicators and definitions used in our proposed evaluation framework. Primary Indicator Secondary Indicator Definition I1. Basic Medical Knowl...
work page 2025
-
[27]
Secondary thrombocytopenia
-
[28]
Stage 2 hypertension, high-risk category Treatment: I. General Management The patient should rest, eat easily digestible foods, maintain oral hygiene, and ensure water and electrolyte balance to reduce complications. For high fever, physical cooling or appropriate use of antipyretic agents may be employed, avoiding excessive sweating caused by certain ant...
-
[29]
Tetracyclines: Recommended dose of doxycycline for adults is 0 .1 g twice a day; the first dose is doubled. Continue the anti-infective therapy for at least 3 days after body temperature returns to normal and clinical symptoms improve and stabilize, generally for 7—10 days. In severe cases, intravenous administration of doxycycline may be considered: on d...
-
[30]
Azithromycin 0.5 g once daily for adults; after fever subsides, 0 .25 g once daily
Macrolides: Roxithromycin 150 mg twice a day for adults; after fever subsides, 150 mg once a day. Azithromycin 0.5 g once daily for adults; after fever subsides, 0 .25 g once daily. Clarithromycin 0.5 g once every 12 hours for adults. All of these regimens are given for 7—10 days. III. Symptomatic and Supportive Treatment In cases of scrub typhus complica...
-
[31]
Splenomegaly (10 points)
-
[32]
Hepatic insufficiency (10 points)
-
[33]
Pleural effusion (10 points)
-
[34]
Hypoproteinemia (10 points)
-
[35]
Secondary thrombocytopenia (10 points)
-
[36]
Grade 2 hypertension; high-risk group (10 points) Referral Decision- making Yes Referral to the department of infectious diseases (100 points). Acute and Critical Condition Recogni- tion Yes Hepatic insufficiency (25 points); Pleural effusion (25 points); Hypoproteinemia (25 points); Secondary thrombocytopenia (25 points) Complication Iden- tification Yes...
-
[37]
Tetracyclines: Doxycycline Recommended dosage for adults: 0.1 g twice a day, with the initial dose doubled. The anti-infective course should continue for at least 3 days after body temper- ature returns to normal and clinical symptoms have improved and stabilized, generally for 7–10 days. For severe cases, intravenous infusion may be selected. On the firs...
-
[38]
Macrolides: Roxithromycin 150 mg twice a day for adults; once the fever subsides, 150 mg once a day. Azithromycin 0.5 g once a day for adults; after deferves- cence, 0.25 g once a day. Clarithromycin 0.5 g per dose for adults, administered orally once every 12 hours. All of these medications are given for a total of 7–10 days. II. Symptomatic and Supporti...
-
[39]
Hello, what seems to be the problem?
Start the consultation by saying: “Hello, what seems to be the problem?”
-
[40]
The following physical examinations are necessary:
When you believe you have enough information to recommend physical examination items, provide them with the phrase: “The following physical examinations are necessary:”
-
[41]
The following auxiliary tests are necessary:
When you believe you have enough information to recommend auxiliary tests, provide them with the phrase: “The following auxiliary tests are necessary:”
-
[42]
The following is the diagnosis and treatment plan:
When you believe you have enough information to make a detailed diagnosis, provide it with the phrase: “The following is the diagnosis and treatment plan:”
-
[43]
After providing the diagnosis and treatment plan, conclude with: “Consultation ended.” 27 Appendix H An example of LLMs’ response and its evaluation result Table H7: An example of a CAD medical case. Case Patient: male, 74 years old. Chief Complaint: Intermittent chest pain for 2 years, worsened over the past week. Present Illness: Two years ago, the pati...
-
[44]
Atherosclerotic coronary artery disease (effort angina, CCS class II, high risk)
-
[45]
Hypertension stage 2 (very high-risk group) Continued on next page 28 Case Treatment: I. Further Examinations Complete blood count, coagulation profile, blood glucose, lipid panel, liver and renal function, BNP, homocysteine, urinalysis; 24 h ambulatory blood pressure monitoring, 24h Holter ECG, echocardiography, vascular ultrasound (carotid and lower ext...
-
[46]
Avoid fried food, animal organs, and cholesterol-rich food
Diet: low salt ( < 6g/day), low fat ( < 25g/day), light and easily digestible foods rich in vita- mins. Avoid fried food, animal organs, and cholesterol-rich food. Prefer small, frequent meals; avoid overeating. No spicy food, alcohol, tobacco, strong tea, or coffee
-
[47]
Physical activity: mild walking before coronary angiography; after procedure, gradually resume moderate aerobic exercise (30–45 min, 5 times/week) depending on condition
-
[48]
Target BP < 140/90 mmHg, ideally < 130/80 mmHg if tol- erated (avoid excessive diastolic drop)
Self-monitoring of BP and heart rate daily, record symptoms (chest pain, tightness, foamy urine), take medications regularly. Target BP < 140/90 mmHg, ideally < 130/80 mmHg if tol- erated (avoid excessive diastolic drop)
-
[49]
Psychological support: help reduce stress, ensure adequate sleep, encourage adherence to treatment plan
-
[50]
Regular follow-up: every 1–3 months to monitor drug side effects
-
[51]
Risk assessment via coronary angiography: for patients with mild angina (CCS I–II) or asymp- tomatic, yet high-risk by non-invasive stratification, consider revascularization. Recommend this patient undergo coronary angiography in a higher-level hospital; stenting if necessary. III. Pharmacological Treatment
-
[52]
Heart rate control: e.g., metoprolol tartrate 50 mg twice daily, target HR 55–60/min
-
[53]
Coronary vasodilation: e.g., isosorbide mononitrate capsule 40 mg once daily
-
[54]
Lipid regulation: e.g., atorvastatin 20 mg once every evening, LDL < 1.8 mmol/L
-
[55]
Antiplatelet therapy: aspirin 100 mg once every evening
-
[56]
Antihypertensive: e.g., irbesartan/hydrochlorothiazide 150 /12.5 mg once daily; maintain BP < 130/80 mmHg if tolerated
-
[57]
Table H8 : o1-preview’s response to the above CAD medical case and its evaluation results
Follow-up regularly; seek care if symptoms occur. Table H8 : o1-preview’s response to the above CAD medical case and its evaluation results. Detailed Evaluation Results Indicator: Diagnosis and Differential Diagnosis Applicability: Yes Scoring Criteria:
-
[58]
Coronary Atherosclerotic Heart Disease (30 points): Effort-induced angina, CCS Class II, high risk (40 points)
-
[59]
Grade 2 Hypertension (Very high-risk group) (30 points) LLM Response: The patient’s clinical manifestations and auxiliary examination results suggest the possible presence of stable angina, primarily caused by coronary artery stenosis. Coronary CTA shows approximately 70% stenosis in the proximal segment of the left anterior descending artery, which is co...
-
[60]
Pharmacological Treatment (Half of the corresponding points will be deducted for incorrect usage or dosage): Heart rate control (Target heart rate: 55–60 bpm, e.g., Metoprolol tartrate 50mg orally twice daily or Metoprolol succinate 23.75mg orally once daily) (15 points); Coronary vasodilation (e.g., Isosorbide mononitrate capsules 40 mg orally once daily...
-
[61]
Regular Follow-up: Follow-up every 1–3 months with monitoring of drug side effects (10 points) LLM Response: Optimize pharmacological treatment by using aspirin 100 mg once daily to prevent thrombosis, atorvastatin 20 mg once daily to control blood lipids, metoprolol 50 mg twice daily to reduce cardiac workload, and nitroglycerin sustained-release tablets...
-
[62]
The diet should be light, easy to digest, and rich in vitamins
Healthy diet: A low-salt, low-fat diet is recommended (20 points), with daily salt intake < 6g and cooking oil < 25g. The diet should be light, easy to digest, and rich in vitamins. Avoid fried foods, animal organs, and other high-cholesterol foods. Eat small, frequent meals and avoid overeating. Spicy foods should be avoided, and smoking, alcohol, strong...
-
[63]
Appropriate physical activity: Before coronary angiography, light walking is recommended while avoiding strenuous exercise. After the procedure, moderate-intensity aerobic exercise (30–45 minutes per session, 5 times/week) should be resumed gradually based on the patient’s condition. (20 points) Continued on next page 30 Detailed Evaluation Results
-
[64]
Be alert to symp- toms such as chest pain, chest tightness, and foamy urine
Daily monitoring: Measure and record blood pressure and heart rate daily. Be alert to symp- toms such as chest pain, chest tightness, and foamy urine. Seek timely medical consultation and adhere strictly to medication schedules. For patients with coronary artery disease and hypertension, a blood pressure target of < 140/90mmHg is recommended; if tolerated...
-
[65]
Psychological support: Help patients prevent and relieve mental stress, maintain sufficient sleep, and actively cooperate with the prescribed treatment plan. (20 points) LLM Response: Instruct the patient to follow a low-salt, low-fat diet; encourage regular and moderate physical activity; educate the patient to take medications on time and regularly moni...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.