Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

Dingqian Wang; Hongji Yu; Jiexian Qiu; Jiping Lang; Junrong Chen; Lin Yao; Shuang Chen; Shuang Li; Wenhao Jiang; Xiaofei Zeng

arxiv: 2503.17599 · v3 · pith:XLP6RMQVnew · submitted 2025-03-22 · 💻 cs.CL · cs.AI

Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

Zheqing Li , Yiying Yang , Jiping Lang , Wenhao Jiang , Junrong Chen , Yuhang Zhao , Shuang Li , Dingqian Wang

show 11 more authors

Zhu Lin Xuanna Li Yuze Tang Jiexian Qiu Xiaolin Lu Hongji Yu Shuang Chen Yuhua Bi Xiaofei Zeng Yixian Chen Lin Yao

This is my paper

Pith reviewed 2026-05-22 23:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsgeneral practiceclinical benchmarkLLM evaluationmedical AIcompetency assessmentprimary care

0 comments

The pith

Large language models are not suitable for autonomous deployment in general practice based on a new expert-annotated benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates GPBench to test LLMs against the actual daily responsibilities of general practitioners rather than simplified exam questions. Ten state-of-the-art models were evaluated using data annotated by domain experts to match routine clinical standards. The results show consistent gaps in performance that prevent independent operation. This finding indicates that realistic clinical use of these models requires ongoing human oversight and that further model optimization focused on GP tasks is still needed.

Core claim

The paper establishes that current LLMs are not suitable for autonomous deployment in clinical general practice. All realistic applications require continuous human oversight, and further optimization specifically tailored to the daily responsibilities of GPs remains essential. The conclusion rests on evaluations conducted with GPBench, a benchmark whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards.

What carries the argument

GPBench, a general practice benchmark whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards, used as the evaluation framework to measure LLM competencies against GP duties.

If this is right

Current LLMs cannot reliably fulfill the duties of general practitioners.
All realistic applications of LLMs in general practice require continuous human oversight.
Further optimization specifically tailored to the daily responsibilities of GPs remains essential.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

GPBench scores could serve as a baseline for measuring progress in future LLM versions aimed at primary care.
Similar expert-annotated benchmarks may prove useful for evaluating LLMs in other medical specialties.
Practical deployment strategies could combine LLMs with structured human review processes calibrated to GPBench performance levels.

Load-bearing premise

The GPBench framework, constructed from expert annotations aligned with routine clinical practice standards, accurately measures the competencies needed for autonomous general practitioner duties.

What would settle it

A demonstration that an LLM reaches human GP performance levels on GPBench tasks during real clinical deployment with no ongoing supervision would challenge the central claim.

read the original abstract

Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not suitable for autonomous deployment in clinical general practice and that all realistic applications require continuous human oversight; further optimization specifically tailored to the daily responsibilities of GPs remains essential.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPBench tries to fix the exam-style gap in medical LLM tests but the abstract gives no scores, agreement stats, or human baselines to support the autonomy claim.

read the letter

The paper's main move is to build GPBench from expert-annotated cases that track routine general practice duties instead of multiple-choice exams. That framing is the clearest difference from earlier medical LLM benchmarks mentioned in the abstract. It directly targets the mismatch between test formats and what GPs actually do day to day, which is a documented shortcoming in the field. The authors then run ten current models and draw the conclusion that none are ready for unsupervised use and that human oversight will stay necessary. The framework itself is the part worth noting; it tries to make evaluation more aligned with real responsibilities. The rest of the abstract is thin. It states the main finding without reporting any model scores, inter-annotator agreement, validation steps for the benchmark items, or error analysis. There are also no licensed-GP baseline numbers on the same cases and no mapping from benchmark mistakes to documented clinical risks. The stress-test concern lands: without those anchors, the low performance could simply mean the tasks are hard or the sample is narrow rather than proving a general rule against autonomous deployment. The paper is aimed at groups working on medical AI evaluation who want benchmarks that feel closer to primary care. A reader already following that literature would pick up the competency angle and the call for tailored optimization. It is not ready to shift deployment decisions on its own. The work deserves a serious referee because the underlying gap is real and the benchmark idea is a reasonable response, even though the current version will need the missing validation data and baselines before the central claim can be assessed.

Referee Report

4 major / 1 minor

Summary. The paper introduces GPBench, a competency-based evaluation framework and benchmark for LLMs in general practice, constructed from expert-annotated cases aligned with routine clinical practice standards. It evaluates ten state-of-the-art LLMs on this benchmark and concludes that current LLMs are not suitable for autonomous deployment in clinical general practice, that all realistic applications require continuous human oversight, and that further optimization tailored to GP daily responsibilities remains essential.

Significance. The introduction of a competency-based framework that moves beyond exam-style or simplified QA formats to align with real-world GP responsibilities is a clear strength and addresses a documented gap in existing medical AI benchmarks. If the GPBench items are shown to be reliable and if performance on them correlates with clinical safety outcomes, the results would provide actionable evidence on the current limitations of LLMs for high-stakes autonomous use and reinforce the importance of human-in-the-loop designs in clinical AI.

major comments (4)

[Abstract] Abstract: the central claim that LLMs are unsuitable for autonomous deployment is asserted without any reported model scores, inter-annotator agreement, benchmark validation steps, or error analysis, preventing assessment of whether the evidence supports the conclusion.
[GPBench construction] GPBench construction: the expert annotation process aligned with routine clinical standards is described, but no inter-annotator reliability statistics are supplied, leaving the consistency and validity of the benchmark data unverified.
[Evaluation results] Evaluation results: no licensed-GP baseline scores on the identical GPBench items are reported, so it is impossible to distinguish whether low LLM performance reflects model inadequacy or benchmark difficulty.
[Discussion] Discussion: the inference that continuous human oversight is required for all realistic applications rests on benchmark scores alone, without any mapping from observed errors to documented adverse events or safety data from primary care.

minor comments (1)

[Abstract] The abstract would benefit from inclusion of the key quantitative LLM performance figures to allow readers to gauge the magnitude of the reported gaps.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback, which highlights important areas for strengthening the manuscript's claims and transparency. We address each major comment point-by-point below, proposing revisions where the points identify verifiable gaps in the current version.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that LLMs are unsuitable for autonomous deployment is asserted without any reported model scores, inter-annotator agreement, benchmark validation steps, or error analysis, preventing assessment of whether the evidence supports the conclusion.

Authors: We agree that the abstract would benefit from greater specificity to allow readers to evaluate the central claim. In the revised version, we will expand the abstract to include summary performance metrics across the ten LLMs (e.g., overall competency scores and key failure rates), a brief reference to the expert annotation process and inter-annotator agreement, and mention of the error analysis presented in the results section. This will make the evidence supporting the conclusion more immediately assessable while remaining within abstract length constraints. revision: yes
Referee: [GPBench construction] GPBench construction: the expert annotation process aligned with routine clinical standards is described, but no inter-annotator reliability statistics are supplied, leaving the consistency and validity of the benchmark data unverified.

Authors: This is a valid observation. While the annotation followed a multi-expert process with standardized clinical guidelines, inter-annotator agreement statistics were not reported in the original submission. We will add these metrics (e.g., Cohen's kappa or percentage agreement) to the methods section of the revised manuscript, calculated on a subset of overlapping annotations, to provide quantitative evidence of benchmark reliability. revision: yes
Referee: [Evaluation results] Evaluation results: no licensed-GP baseline scores on the identical GPBench items are reported, so it is impossible to distinguish whether low LLM performance reflects model inadequacy or benchmark difficulty.

Authors: We acknowledge the value of a direct human baseline for calibration. The GPBench items were constructed from routine clinical practice standards that licensed GPs are expected to meet, providing an implicit reference point. However, administering the full benchmark to a cohort of licensed GPs would require substantial additional resources, recruitment, and ethics approvals not feasible within the current study timeline. In the revision, we will explicitly discuss this limitation, provide qualitative context on expected GP performance based on the competency framework, and recommend human baseline collection as important future work. revision: partial
Referee: [Discussion] Discussion: the inference that continuous human oversight is required for all realistic applications rests on benchmark scores alone, without any mapping from observed errors to documented adverse events or safety data from primary care.

Authors: We agree that stronger linkage to real-world safety outcomes would reinforce the practical implications. The current inference draws from the benchmark's alignment with documented GP competencies, where failures in areas such as diagnosis, management, or communication carry inherent clinical risks. We will revise the discussion to more explicitly acknowledge the absence of direct adverse-event mapping, clarify that the recommendation for human oversight is based on competency gaps rather than proven harm, and highlight the need for future studies correlating GPBench performance with primary-care safety data. revision: yes

Circularity Check

0 steps flagged

No circularity: new expert-annotated benchmark with no equations, fits, or self-referential derivations

full rationale

The paper constructs GPBench from fresh expert annotations aligned to routine clinical standards and evaluates LLMs on it. No equations, parameters, or predictions are fitted to subsets of the data; the central claim (LLMs unsuitable for autonomous use) follows directly from observed scores on the new benchmark rather than reducing to any self-definition, self-citation chain, or renamed known result. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation framework depends on the assumption that expert annotations faithfully represent real GP competencies; no free parameters or new entities are introduced.

axioms (1)

domain assumption Domain expert annotations accurately reflect routine clinical practice standards for general practitioners.
Benchmark data are meticulously annotated by domain experts in accordance with routine clinical practice standards.

pith-pipeline@v0.9.0 · 5749 in / 925 out tokens · 39126 ms · 2026-05-22T23:38:03.903457+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 7 internal anchors

[1]

Menezes, M. C. S. et al. The potential of generative pre-trained transformer 4 (gpt-4) to analyse medical notes in three different languages: a retrospective model-evaluation study. The Lancet Digital Health 7, e35–e43 (2025)

work page 2025
[2]

& Bignami, E

Bellini, V. & Bignami, E. G. Generative pre-trained transformer 4 (gpt-4) in clinical settings. The Lancet Digital Health 7, e6–e7 (2025)

work page 2025
[3]

Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023)

work page 2023
[4]

Singhal, K. et al. Toward expert-level medical question answering with large language models. Nature Medicine 1–8 (2025)

work page 2025
[5]

Strong, E. et al. Chatbot vs medical student performance on free-response clinical reasoning examinations. JAMA internal medicine 183, 1028–1030 (2023)

work page 2023
[6]

Gilson, A. et al. How does chatgpt perform on the united states medical licens- ing examination (usmle)? the implications of large language models for medical education and knowledge assessment. JMIR medical education 9, e45312 (2023)

work page 2023
[7]

Jin, D. et al. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences 11, 6421 (2021)

work page 2021
[8]

McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 1–7 (2025)

work page 2025
[9]

Hurst, A. et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Jaech, A. et al. Openai o1 system card. arXiv preprint arXiv:2412.16720 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Team, G. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Yang, A. et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Liu, A. et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024). 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Chen, J. et al. Huatuogpt-o1, towards medical complex reasoning with llms. arXiv preprint arXiv:2412.18925 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

The european definition of general practice/family medicine-2023 edition

Europe, W. The european definition of general practice/family medicine-2023 edition. Barcelona: WONCA Europe (2023). URL https://www.woncaeurope. org/page/definition-of-general-practice-family-medicine

work page 2023
[17]

Scherger, J. E. Preparing the personal physician for practice (p4): essential skills for new family physicians and how residency programs may provide them. The Journal of the American Board of Family Medicine 20, 348–355 (2007)

work page 2007
[18]

McClelland, D. C. Testing for competence rather than for” intelligence.”. American psychologist 28, 1 (1973)

work page 1973
[19]

Boyatzis, R. E. The competent manager: A model for effective performance (John Wiley & Sons, 1991)

work page 1991
[20]

Wang, X. et al. Cmb: A comprehensive medical benchmark in chinese. arXiv preprint arXiv:2308.08833 (2023)

work page arXiv 2023
[21]

Liu, M. et al. Medbench: A comprehensive, standardized, and reliable bench- marking system for evaluating chinese medical large language models. Big Data Mining and Analytics (2024). URL https://www.sciopen.com/article/10.26599/ BDMA.2024.9020044

work page arXiv 2024
[22]

Collaborators, G. et al. Global, regional, and national age–sex specific all-cause and cause-specific mortality for 240 causes of death, 1990–2013: a systematic analysis for the global burden of disease study 2013. The Lancet 385, 117–171 (2015)

work page 1990
[23]

Zhou, M. et al. Cause-specific mortality for 240 causes in china during 1990–2013: a systematic subnational analysis for the global burden of disease study 2013. The Lancet 387, 251–272 (2016)

work page 1990
[24]

Peng, W. et al. Trends in major non-communicable diseases and related risk factors in china 2002–2019: an analysis of nationally representative survey data. The Lancet Regional Health–Western Pacific 43 (2024)

work page 2002
[25]

Chen, X. et al. Enhancing diagnostic capability with multi-agents conversational large language models. NPJ digital medicine 8, 159 (2025)

work page 2025
[26]

preventing diseases before they occur, preventing disease progression during illness, and preventing recurrence after illness

Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 1–9 (2025). 18 Appendix A The competency indicators and definitions used in our proposed evaluation framework Table A1: The competency indicators and definitions used in our proposed evaluation framework. Primary Indicator Secondary Indicator Definition I1. Basic Medical Knowl...

work page 2025
[27]

Secondary thrombocytopenia

work page
[28]

General Management The patient should rest, eat easily digestible foods, maintain oral hygiene, and ensure water and electrolyte balance to reduce complications

Stage 2 hypertension, high-risk category Treatment: I. General Management The patient should rest, eat easily digestible foods, maintain oral hygiene, and ensure water and electrolyte balance to reduce complications. For high fever, physical cooling or appropriate use of antipyretic agents may be employed, avoiding excessive sweating caused by certain ant...

work page
[29]

Continue the anti-infective therapy for at least 3 days after body temperature returns to normal and clinical symptoms improve and stabilize, generally for 7—10 days

Tetracyclines: Recommended dose of doxycycline for adults is 0 .1 g twice a day; the first dose is doubled. Continue the anti-infective therapy for at least 3 days after body temperature returns to normal and clinical symptoms improve and stabilize, generally for 7—10 days. In severe cases, intravenous administration of doxycycline may be considered: on d...

work page
[30]

Azithromycin 0.5 g once daily for adults; after fever subsides, 0 .25 g once daily

Macrolides: Roxithromycin 150 mg twice a day for adults; after fever subsides, 150 mg once a day. Azithromycin 0.5 g once daily for adults; after fever subsides, 0 .25 g once daily. Clarithromycin 0.5 g once every 12 hours for adults. All of these regimens are given for 7—10 days. III. Symptomatic and Supportive Treatment In cases of scrub typhus complica...

work page
[31]

Splenomegaly (10 points)

work page
[32]

Hepatic insufficiency (10 points)

work page
[33]

Pleural effusion (10 points)

work page
[34]

Hypoproteinemia (10 points)

work page
[35]

Secondary thrombocytopenia (10 points)

work page
[36]

Grade 2 hypertension; high-risk group (10 points) Referral Decision- making Yes Referral to the department of infectious diseases (100 points). Acute and Critical Condition Recogni- tion Yes Hepatic insufficiency (25 points); Pleural effusion (25 points); Hypoproteinemia (25 points); Secondary thrombocytopenia (25 points) Complication Iden- tification Yes...

work page
[37]

The anti-infective course should continue for at least 3 days after body temper- ature returns to normal and clinical symptoms have improved and stabilized, generally for 7–10 days

Tetracyclines: Doxycycline Recommended dosage for adults: 0.1 g twice a day, with the initial dose doubled. The anti-infective course should continue for at least 3 days after body temper- ature returns to normal and clinical symptoms have improved and stabilized, generally for 7–10 days. For severe cases, intravenous infusion may be selected. On the firs...

work page
[38]

I’m not sure

Macrolides: Roxithromycin 150 mg twice a day for adults; once the fever subsides, 150 mg once a day. Azithromycin 0.5 g once a day for adults; after deferves- cence, 0.25 g once a day. Clarithromycin 0.5 g per dose for adults, administered orally once every 12 hours. All of these medications are given for a total of 7–10 days. II. Symptomatic and Supporti...

work page
[39]

Hello, what seems to be the problem?

Start the consultation by saying: “Hello, what seems to be the problem?”

work page
[40]

The following physical examinations are necessary:

When you believe you have enough information to recommend physical examination items, provide them with the phrase: “The following physical examinations are necessary:”

work page
[41]

The following auxiliary tests are necessary:

When you believe you have enough information to recommend auxiliary tests, provide them with the phrase: “The following auxiliary tests are necessary:”

work page
[42]

The following is the diagnosis and treatment plan:

When you believe you have enough information to make a detailed diagnosis, provide it with the phrase: “The following is the diagnosis and treatment plan:”

work page
[43]

Consultation ended

After providing the diagnosis and treatment plan, conclude with: “Consultation ended.” 27 Appendix H An example of LLMs’ response and its evaluation result Table H7: An example of a CAD medical case. Case Patient: male, 74 years old. Chief Complaint: Intermittent chest pain for 2 years, worsened over the past week. Present Illness: Two years ago, the pati...

work page
[44]

Atherosclerotic coronary artery disease (effort angina, CCS class II, high risk)

work page
[45]

Hypertension stage 2 (very high-risk group) Continued on next page 28 Case Treatment: I. Further Examinations Complete blood count, coagulation profile, blood glucose, lipid panel, liver and renal function, BNP, homocysteine, urinalysis; 24 h ambulatory blood pressure monitoring, 24h Holter ECG, echocardiography, vascular ultrasound (carotid and lower ext...

work page
[46]

Avoid fried food, animal organs, and cholesterol-rich food

Diet: low salt ( < 6g/day), low fat ( < 25g/day), light and easily digestible foods rich in vita- mins. Avoid fried food, animal organs, and cholesterol-rich food. Prefer small, frequent meals; avoid overeating. No spicy food, alcohol, tobacco, strong tea, or coffee

work page
[47]

Physical activity: mild walking before coronary angiography; after procedure, gradually resume moderate aerobic exercise (30–45 min, 5 times/week) depending on condition

work page
[48]

Target BP < 140/90 mmHg, ideally < 130/80 mmHg if tol- erated (avoid excessive diastolic drop)

Self-monitoring of BP and heart rate daily, record symptoms (chest pain, tightness, foamy urine), take medications regularly. Target BP < 140/90 mmHg, ideally < 130/80 mmHg if tol- erated (avoid excessive diastolic drop)

work page
[49]

Psychological support: help reduce stress, ensure adequate sleep, encourage adherence to treatment plan

work page
[50]

Regular follow-up: every 1–3 months to monitor drug side effects

work page
[51]

Recommend this patient undergo coronary angiography in a higher-level hospital; stenting if necessary

Risk assessment via coronary angiography: for patients with mild angina (CCS I–II) or asymp- tomatic, yet high-risk by non-invasive stratification, consider revascularization. Recommend this patient undergo coronary angiography in a higher-level hospital; stenting if necessary. III. Pharmacological Treatment

work page
[52]

Heart rate control: e.g., metoprolol tartrate 50 mg twice daily, target HR 55–60/min

work page
[53]

Coronary vasodilation: e.g., isosorbide mononitrate capsule 40 mg once daily

work page
[54]

Lipid regulation: e.g., atorvastatin 20 mg once every evening, LDL < 1.8 mmol/L

work page
[55]

Antiplatelet therapy: aspirin 100 mg once every evening

work page
[56]

Antihypertensive: e.g., irbesartan/hydrochlorothiazide 150 /12.5 mg once daily; maintain BP < 130/80 mmHg if tolerated

work page
[57]

Table H8 : o1-preview’s response to the above CAD medical case and its evaluation results

Follow-up regularly; seek care if symptoms occur. Table H8 : o1-preview’s response to the above CAD medical case and its evaluation results. Detailed Evaluation Results Indicator: Diagnosis and Differential Diagnosis Applicability: Yes Scoring Criteria:

work page
[58]

Coronary Atherosclerotic Heart Disease (30 points): Effort-induced angina, CCS Class II, high risk (40 points)

work page
[59]

Grade 2 Hypertension (Very high-risk group) (30 points) LLM Response: The patient’s clinical manifestations and auxiliary examination results suggest the possible presence of stable angina, primarily caused by coronary artery stenosis. Coronary CTA shows approximately 70% stenosis in the proximal segment of the left anterior descending artery, which is co...

work page
[60]

Pharmacological Treatment (Half of the corresponding points will be deducted for incorrect usage or dosage): Heart rate control (Target heart rate: 55–60 bpm, e.g., Metoprolol tartrate 50mg orally twice daily or Metoprolol succinate 23.75mg orally once daily) (15 points); Coronary vasodilation (e.g., Isosorbide mononitrate capsules 40 mg orally once daily...

work page
[61]

Additionally, coronary interventional therapy, such as stent implantation, is recommended to improve blood flow

Regular Follow-up: Follow-up every 1–3 months with monitoring of drug side effects (10 points) LLM Response: Optimize pharmacological treatment by using aspirin 100 mg once daily to prevent thrombosis, atorvastatin 20 mg once daily to control blood lipids, metoprolol 50 mg twice daily to reduce cardiac workload, and nitroglycerin sustained-release tablets...

work page
[62]

The diet should be light, easy to digest, and rich in vitamins

Healthy diet: A low-salt, low-fat diet is recommended (20 points), with daily salt intake < 6g and cooking oil < 25g. The diet should be light, easy to digest, and rich in vitamins. Avoid fried foods, animal organs, and other high-cholesterol foods. Eat small, frequent meals and avoid overeating. Spicy foods should be avoided, and smoking, alcohol, strong...

work page
[63]

After the procedure, moderate-intensity aerobic exercise (30–45 minutes per session, 5 times/week) should be resumed gradually based on the patient’s condition

Appropriate physical activity: Before coronary angiography, light walking is recommended while avoiding strenuous exercise. After the procedure, moderate-intensity aerobic exercise (30–45 minutes per session, 5 times/week) should be resumed gradually based on the patient’s condition. (20 points) Continued on next page 30 Detailed Evaluation Results

work page
[64]

Be alert to symp- toms such as chest pain, chest tightness, and foamy urine

Daily monitoring: Measure and record blood pressure and heart rate daily. Be alert to symp- toms such as chest pain, chest tightness, and foamy urine. Seek timely medical consultation and adhere strictly to medication schedules. For patients with coronary artery disease and hypertension, a blood pressure target of < 140/90mmHg is recommended; if tolerated...

work page
[65]

Psychological support: Help patients prevent and relieve mental stress, maintain sufficient sleep, and actively cooperate with the prescribed treatment plan. (20 points) LLM Response: Instruct the patient to follow a low-salt, low-fat diet; encourage regular and moderate physical activity; educate the patient to take medications on time and regularly moni...

work page

[1] [1]

Menezes, M. C. S. et al. The potential of generative pre-trained transformer 4 (gpt-4) to analyse medical notes in three different languages: a retrospective model-evaluation study. The Lancet Digital Health 7, e35–e43 (2025)

work page 2025

[2] [2]

& Bignami, E

Bellini, V. & Bignami, E. G. Generative pre-trained transformer 4 (gpt-4) in clinical settings. The Lancet Digital Health 7, e6–e7 (2025)

work page 2025

[3] [3]

Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023)

work page 2023

[4] [4]

Singhal, K. et al. Toward expert-level medical question answering with large language models. Nature Medicine 1–8 (2025)

work page 2025

[5] [5]

Strong, E. et al. Chatbot vs medical student performance on free-response clinical reasoning examinations. JAMA internal medicine 183, 1028–1030 (2023)

work page 2023

[6] [6]

Gilson, A. et al. How does chatgpt perform on the united states medical licens- ing examination (usmle)? the implications of large language models for medical education and knowledge assessment. JMIR medical education 9, e45312 (2023)

work page 2023

[7] [7]

Jin, D. et al. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences 11, 6421 (2021)

work page 2021

[8] [8]

McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 1–7 (2025)

work page 2025

[9] [9]

Hurst, A. et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Jaech, A. et al. Openai o1 system card. arXiv preprint arXiv:2412.16720 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Team, G. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Yang, A. et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Liu, A. et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024). 17

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Chen, J. et al. Huatuogpt-o1, towards medical complex reasoning with llms. arXiv preprint arXiv:2412.18925 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

The european definition of general practice/family medicine-2023 edition

Europe, W. The european definition of general practice/family medicine-2023 edition. Barcelona: WONCA Europe (2023). URL https://www.woncaeurope. org/page/definition-of-general-practice-family-medicine

work page 2023

[17] [17]

Scherger, J. E. Preparing the personal physician for practice (p4): essential skills for new family physicians and how residency programs may provide them. The Journal of the American Board of Family Medicine 20, 348–355 (2007)

work page 2007

[18] [18]

McClelland, D. C. Testing for competence rather than for” intelligence.”. American psychologist 28, 1 (1973)

work page 1973

[19] [19]

Boyatzis, R. E. The competent manager: A model for effective performance (John Wiley & Sons, 1991)

work page 1991

[20] [20]

Wang, X. et al. Cmb: A comprehensive medical benchmark in chinese. arXiv preprint arXiv:2308.08833 (2023)

work page arXiv 2023

[21] [21]

Liu, M. et al. Medbench: A comprehensive, standardized, and reliable bench- marking system for evaluating chinese medical large language models. Big Data Mining and Analytics (2024). URL https://www.sciopen.com/article/10.26599/ BDMA.2024.9020044

work page arXiv 2024

[22] [22]

Collaborators, G. et al. Global, regional, and national age–sex specific all-cause and cause-specific mortality for 240 causes of death, 1990–2013: a systematic analysis for the global burden of disease study 2013. The Lancet 385, 117–171 (2015)

work page 1990

[23] [23]

Zhou, M. et al. Cause-specific mortality for 240 causes in china during 1990–2013: a systematic subnational analysis for the global burden of disease study 2013. The Lancet 387, 251–272 (2016)

work page 1990

[24] [24]

Peng, W. et al. Trends in major non-communicable diseases and related risk factors in china 2002–2019: an analysis of nationally representative survey data. The Lancet Regional Health–Western Pacific 43 (2024)

work page 2002

[25] [25]

Chen, X. et al. Enhancing diagnostic capability with multi-agents conversational large language models. NPJ digital medicine 8, 159 (2025)

work page 2025

[26] [26]

preventing diseases before they occur, preventing disease progression during illness, and preventing recurrence after illness

Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 1–9 (2025). 18 Appendix A The competency indicators and definitions used in our proposed evaluation framework Table A1: The competency indicators and definitions used in our proposed evaluation framework. Primary Indicator Secondary Indicator Definition I1. Basic Medical Knowl...

work page 2025

[27] [27]

Secondary thrombocytopenia

work page

[28] [28]

General Management The patient should rest, eat easily digestible foods, maintain oral hygiene, and ensure water and electrolyte balance to reduce complications

Stage 2 hypertension, high-risk category Treatment: I. General Management The patient should rest, eat easily digestible foods, maintain oral hygiene, and ensure water and electrolyte balance to reduce complications. For high fever, physical cooling or appropriate use of antipyretic agents may be employed, avoiding excessive sweating caused by certain ant...

work page

[29] [29]

Continue the anti-infective therapy for at least 3 days after body temperature returns to normal and clinical symptoms improve and stabilize, generally for 7—10 days

Tetracyclines: Recommended dose of doxycycline for adults is 0 .1 g twice a day; the first dose is doubled. Continue the anti-infective therapy for at least 3 days after body temperature returns to normal and clinical symptoms improve and stabilize, generally for 7—10 days. In severe cases, intravenous administration of doxycycline may be considered: on d...

work page

[30] [30]

Azithromycin 0.5 g once daily for adults; after fever subsides, 0 .25 g once daily

Macrolides: Roxithromycin 150 mg twice a day for adults; after fever subsides, 150 mg once a day. Azithromycin 0.5 g once daily for adults; after fever subsides, 0 .25 g once daily. Clarithromycin 0.5 g once every 12 hours for adults. All of these regimens are given for 7—10 days. III. Symptomatic and Supportive Treatment In cases of scrub typhus complica...

work page

[31] [31]

Splenomegaly (10 points)

work page

[32] [32]

Hepatic insufficiency (10 points)

work page

[33] [33]

Pleural effusion (10 points)

work page

[34] [34]

Hypoproteinemia (10 points)

work page

[35] [35]

Secondary thrombocytopenia (10 points)

work page

[36] [36]

Grade 2 hypertension; high-risk group (10 points) Referral Decision- making Yes Referral to the department of infectious diseases (100 points). Acute and Critical Condition Recogni- tion Yes Hepatic insufficiency (25 points); Pleural effusion (25 points); Hypoproteinemia (25 points); Secondary thrombocytopenia (25 points) Complication Iden- tification Yes...

work page

[37] [37]

The anti-infective course should continue for at least 3 days after body temper- ature returns to normal and clinical symptoms have improved and stabilized, generally for 7–10 days

Tetracyclines: Doxycycline Recommended dosage for adults: 0.1 g twice a day, with the initial dose doubled. The anti-infective course should continue for at least 3 days after body temper- ature returns to normal and clinical symptoms have improved and stabilized, generally for 7–10 days. For severe cases, intravenous infusion may be selected. On the firs...

work page

[38] [38]

I’m not sure

Macrolides: Roxithromycin 150 mg twice a day for adults; once the fever subsides, 150 mg once a day. Azithromycin 0.5 g once a day for adults; after deferves- cence, 0.25 g once a day. Clarithromycin 0.5 g per dose for adults, administered orally once every 12 hours. All of these medications are given for a total of 7–10 days. II. Symptomatic and Supporti...

work page

[39] [39]

Hello, what seems to be the problem?

Start the consultation by saying: “Hello, what seems to be the problem?”

work page

[40] [40]

The following physical examinations are necessary:

When you believe you have enough information to recommend physical examination items, provide them with the phrase: “The following physical examinations are necessary:”

work page

[41] [41]

The following auxiliary tests are necessary:

When you believe you have enough information to recommend auxiliary tests, provide them with the phrase: “The following auxiliary tests are necessary:”

work page

[42] [42]

The following is the diagnosis and treatment plan:

When you believe you have enough information to make a detailed diagnosis, provide it with the phrase: “The following is the diagnosis and treatment plan:”

work page

[43] [43]

Consultation ended

After providing the diagnosis and treatment plan, conclude with: “Consultation ended.” 27 Appendix H An example of LLMs’ response and its evaluation result Table H7: An example of a CAD medical case. Case Patient: male, 74 years old. Chief Complaint: Intermittent chest pain for 2 years, worsened over the past week. Present Illness: Two years ago, the pati...

work page

[44] [44]

Atherosclerotic coronary artery disease (effort angina, CCS class II, high risk)

work page

[45] [45]

Hypertension stage 2 (very high-risk group) Continued on next page 28 Case Treatment: I. Further Examinations Complete blood count, coagulation profile, blood glucose, lipid panel, liver and renal function, BNP, homocysteine, urinalysis; 24 h ambulatory blood pressure monitoring, 24h Holter ECG, echocardiography, vascular ultrasound (carotid and lower ext...

work page

[46] [46]

Avoid fried food, animal organs, and cholesterol-rich food

Diet: low salt ( < 6g/day), low fat ( < 25g/day), light and easily digestible foods rich in vita- mins. Avoid fried food, animal organs, and cholesterol-rich food. Prefer small, frequent meals; avoid overeating. No spicy food, alcohol, tobacco, strong tea, or coffee

work page

[47] [47]

Physical activity: mild walking before coronary angiography; after procedure, gradually resume moderate aerobic exercise (30–45 min, 5 times/week) depending on condition

work page

[48] [48]

Target BP < 140/90 mmHg, ideally < 130/80 mmHg if tol- erated (avoid excessive diastolic drop)

Self-monitoring of BP and heart rate daily, record symptoms (chest pain, tightness, foamy urine), take medications regularly. Target BP < 140/90 mmHg, ideally < 130/80 mmHg if tol- erated (avoid excessive diastolic drop)

work page

[49] [49]

Psychological support: help reduce stress, ensure adequate sleep, encourage adherence to treatment plan

work page

[50] [50]

Regular follow-up: every 1–3 months to monitor drug side effects

work page

[51] [51]

Recommend this patient undergo coronary angiography in a higher-level hospital; stenting if necessary

Risk assessment via coronary angiography: for patients with mild angina (CCS I–II) or asymp- tomatic, yet high-risk by non-invasive stratification, consider revascularization. Recommend this patient undergo coronary angiography in a higher-level hospital; stenting if necessary. III. Pharmacological Treatment

work page

[52] [52]

Heart rate control: e.g., metoprolol tartrate 50 mg twice daily, target HR 55–60/min

work page

[53] [53]

Coronary vasodilation: e.g., isosorbide mononitrate capsule 40 mg once daily

work page

[54] [54]

Lipid regulation: e.g., atorvastatin 20 mg once every evening, LDL < 1.8 mmol/L

work page

[55] [55]

Antiplatelet therapy: aspirin 100 mg once every evening

work page

[56] [56]

Antihypertensive: e.g., irbesartan/hydrochlorothiazide 150 /12.5 mg once daily; maintain BP < 130/80 mmHg if tolerated

work page

[57] [57]

Table H8 : o1-preview’s response to the above CAD medical case and its evaluation results

Follow-up regularly; seek care if symptoms occur. Table H8 : o1-preview’s response to the above CAD medical case and its evaluation results. Detailed Evaluation Results Indicator: Diagnosis and Differential Diagnosis Applicability: Yes Scoring Criteria:

work page

[58] [58]

Coronary Atherosclerotic Heart Disease (30 points): Effort-induced angina, CCS Class II, high risk (40 points)

work page

[59] [59]

Grade 2 Hypertension (Very high-risk group) (30 points) LLM Response: The patient’s clinical manifestations and auxiliary examination results suggest the possible presence of stable angina, primarily caused by coronary artery stenosis. Coronary CTA shows approximately 70% stenosis in the proximal segment of the left anterior descending artery, which is co...

work page

[60] [60]

Pharmacological Treatment (Half of the corresponding points will be deducted for incorrect usage or dosage): Heart rate control (Target heart rate: 55–60 bpm, e.g., Metoprolol tartrate 50mg orally twice daily or Metoprolol succinate 23.75mg orally once daily) (15 points); Coronary vasodilation (e.g., Isosorbide mononitrate capsules 40 mg orally once daily...

work page

[61] [61]

Additionally, coronary interventional therapy, such as stent implantation, is recommended to improve blood flow

Regular Follow-up: Follow-up every 1–3 months with monitoring of drug side effects (10 points) LLM Response: Optimize pharmacological treatment by using aspirin 100 mg once daily to prevent thrombosis, atorvastatin 20 mg once daily to control blood lipids, metoprolol 50 mg twice daily to reduce cardiac workload, and nitroglycerin sustained-release tablets...

work page

[62] [62]

The diet should be light, easy to digest, and rich in vitamins

Healthy diet: A low-salt, low-fat diet is recommended (20 points), with daily salt intake < 6g and cooking oil < 25g. The diet should be light, easy to digest, and rich in vitamins. Avoid fried foods, animal organs, and other high-cholesterol foods. Eat small, frequent meals and avoid overeating. Spicy foods should be avoided, and smoking, alcohol, strong...

work page

[63] [63]

After the procedure, moderate-intensity aerobic exercise (30–45 minutes per session, 5 times/week) should be resumed gradually based on the patient’s condition

Appropriate physical activity: Before coronary angiography, light walking is recommended while avoiding strenuous exercise. After the procedure, moderate-intensity aerobic exercise (30–45 minutes per session, 5 times/week) should be resumed gradually based on the patient’s condition. (20 points) Continued on next page 30 Detailed Evaluation Results

work page

[64] [64]

Be alert to symp- toms such as chest pain, chest tightness, and foamy urine

Daily monitoring: Measure and record blood pressure and heart rate daily. Be alert to symp- toms such as chest pain, chest tightness, and foamy urine. Seek timely medical consultation and adhere strictly to medication schedules. For patients with coronary artery disease and hypertension, a blood pressure target of < 140/90mmHg is recommended; if tolerated...

work page

[65] [65]

Psychological support: Help patients prevent and relieve mental stress, maintain sufficient sleep, and actively cooperate with the prescribed treatment plan. (20 points) LLM Response: Instruct the patient to follow a low-salt, low-fat diet; encourage regular and moderate physical activity; educate the patient to take medications on time and regularly moni...

work page