Large Language Models versus Classical Machine Learning: Performance in COVID-19 Mortality Prediction Using High-Dimensional Tabular Data

Ali Moradi; Alireza Soheilipour; Ali Soroush; Ameneh Salehi; Amirali Mohsenzadeh-Kermani; Amirali Soheili; Amirreza Allahgholipour; Ankit Sakhuja; Arian Salahi-Niri; Aydin Feyzi

arxiv: 2409.02136 · v2 · submitted 2024-09-02 · 💻 cs.LG · cs.AI· cs.CL

Large Language Models versus Classical Machine Learning: Performance in COVID-19 Mortality Prediction Using High-Dimensional Tabular Data

Mohammadreza Ghaffarzadeh-Esfahani , Mahdi Ghaffarzadeh-Esfahani , Arian Salahi-Niri , Hossein Toreyhi , Zahra Atf , Amirali Mohsenzadeh-Kermani , Mahshad Sarikhani , Zohreh Tajabadi

show 34 more authors

Fatemeh Shojaeian Mohammad Hassan Bagheri Aydin Feyzi Mohammadamin Tarighatpayma Narges Gazmeh Fateme Heydari Hossein Afshar Amirreza Allahgholipour Farid Alimardani Ameneh Salehi Naghmeh Asadimanesh Mohammad Amin Khalafi Hadis Shabanipour Ali Moradi Sajjad Hossein Zadeh Omid Yazdani Romina Esbati Moozhan Maleki Danial Samiei Nasr Amirali Soheili Hossein Majlesi Saba Shahsavan Alireza Soheilipour Nooshin Goudarzi Erfan Taherifard Hamidreza Hatamabadi Jamil S Samaan Thomas Savage Ankit Sakhuja Ali Soroush Girish Nadkarni Ilad Alavi Darazam Mohamad Amin Pourhoseingholi Seyed Amir Ahmad Safavi-Naini

This is my paper

Pith reviewed 2026-05-23 21:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords COVID-19 mortality predictionlarge language modelsclassical machine learningtabular datafine-tuningXGBoostQLoRAexternal validation

0 comments

The pith

Classical machine learning models outperform large language models when predicting COVID-19 mortality from high-dimensional tabular patient records.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates seven classical machine learning models against eight large language models on mortality prediction using structured data from 9,134 patients across four hospitals. XGBoost and random forest reach F1 scores of 0.87 internally and 0.83 externally, while GPT-4 scores 0.43 in zero-shot mode and a fine-tuned Mistral-7b reaches 0.74 externally after QLoRA adaptation. A sympathetic reader would care because the work directly tests whether general-purpose language models can match specialized tabular methods on real medical records without native tabular handling.

Core claim

XGBoost and random forest demonstrated superior performance among CMLs, achieving F1 scores of 0.87 and 0.83 for internal and external validation, respectively. GPT-4 led the LLM category with an F1 score of 0.43, while fine-tuning Mistral-7b significantly improved its recall from 1% to 79%, yielding a stable F1 score of 0.74 during external validation. Although LLMs showed moderate performance in zero-shot classification, fine-tuning substantially enhanced their effectiveness, potentially bridging the gap with CML models. However, CMLs still outperformed LLMs in handling high-dimensional tabular data tasks.

What carries the argument

Comparison of F1, precision and recall between CML baselines (XGBoost, random forest) and LLMs on tabular records converted to text, including zero-shot prompting versus QLoRA fine-tuning of Mistral-7b.

If this is right

Fine-tuning via QLoRA raises LLM recall from 1% to 79% and produces a stable external F1 of 0.74.
CMLs remain the stronger choice for accuracy on high-dimensional structured medical data.
Both CMLs and fine-tuned LLMs can support medical predictive modeling tasks.
Text conversion of tabular records works as an input method but does not close the full performance gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Native tabular input formats for LLMs could narrow or eliminate the remaining accuracy difference observed here.
The results indicate that even on structured data, LLMs gain substantially from task-specific fine-tuning rather than zero-shot use.
Repeating the comparison on non-COVID tabular medical datasets would test whether the observed CML advantage generalizes.

Load-bearing premise

Converting tabular patient records into natural language text provides a fair and representative test of LLM capability on structured high-dimensional data.

What would settle it

An experiment that feeds the same patient features directly into an LLM in native tabular format and produces F1 scores matching or exceeding XGBoost on the identical external validation set would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 2409.02136 by Ali Moradi, Alireza Soheilipour, Ali Soroush, Ameneh Salehi, Amirali Mohsenzadeh-Kermani, Amirali Soheili, Amirreza Allahgholipour, Ankit Sakhuja, Arian Salahi-Niri, Aydin Feyzi, Danial Samiei Nasr, Erfan Taherifard, Farid Alimardani, Fateme Heydari, Fatemeh Shojaeian, Girish Nadkarni, Hadis Shabanipour, Hamidreza Hatamabadi, Hossein Afshar, Hossein Majlesi, Hossein Toreyhi, Ilad Alavi Darazam, Jamil S Samaan, Mahdi Ghaffarzadeh-Esfahani, Mahshad Sarikhani, Mohamad Amin Pourhoseingholi, Mohammad Amin Khalafi, Mohammadamin Tarighatpayma, Mohammad Hassan Bagheri, Mohammadreza Ghaffarzadeh-Esfahani, Moozhan Maleki, Naghmeh Asadimanesh, Narges Gazmeh, Nooshin Goudarzi, Omid Yazdani, Romina Esbati, Saba Shahsavan, Sajjad Hossein Zadeh, Seyed Amir Ahmad Safavi-Naini, Thomas Savage, Zahra Atf, Zohreh Tajabadi.

**Figure 2.** Figure 2: ROC curves and AUC scores for internal and external validation of COVID-19 mortality prediction models, including the fine-tuned LLM (Mistral 7b) and 7 classical machine learning algorithms. Abbreviations: LLM, Large Language Model; LR, Logistic Regression; SVM, Support Vector Machine; kNN, kNearest Neighbors; TPR, True Positive Rate; FPR, False Positive Rate; AUC, Area Under the Curve [PITH_FULL_IMAGE:f… view at source ↗

**Figure 3.** Figure 3: Performance comparison of classical machine learning models and fine-tuned large language models in COVID-19 mortality prediction using different sample sizes. Caption: This figure presents the performance of seven conventional machine learning (CML) models and a finetuned large language model (LLM) in predicting COVID-19 mortality. The performance metrics evaluated are the F1 score and accuracy across di… view at source ↗

**Figure 4.** Figure 4: a, the most influential features are age (11.18%) and O2 saturation (9.89%), followed by LOC (4.83%), lymphocyte count (4.79%), dyspnea (3.76%), and sex (3.68%). Conversely, the influence of features in LLMs, particularly in lower-performing models such as Mistralb-7b and GPT4o, appears less coherent, as illustrated in Supplementary Figure S6. This inconsistency contributes to noise in the average feature … view at source ↗

**Figure 4.** Figure 4: SHAP analysis of feature importance in COVID-19 mortality prediction models, including global feature impact in classical machine learning (CML) average (a), CML best performing XGBoost (b), large language models (LLM) average (e), LLM best performing GPT-4 (d), and the granular impact of XGBoost and GPT-4 Abbreviations: CML, classical machine learning; LLM, large language model; VS, vital sign; RR, respir… view at source ↗

**Figure 5.** Figure 5: illustrates how fine-tuning Mistral-7b altered the impact of features at both the global and granular levels. This refinement in prediction logic aligned the top 10 most important features more closely with those of CMLs, resulting in more equitable impact percentages among features and enhanced granularity [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗

read the original abstract

This study compared the performance of classical feature-based machine learning models (CMLs) and large language models (LLMs) in predicting COVID-19 mortality using high-dimensional tabular data from 9,134 patients across four hospitals. Seven CML models, including XGBoost and random forest (RF), were evaluated alongside eight LLMs, such as GPT-4 and Mistral-7b, which performed zero-shot classification on text-converted structured data. Additionally, Mistral- 7b was fine-tuned using the QLoRA approach. XGBoost and RF demonstrated superior performance among CMLs, achieving F1 scores of 0.87 and 0.83 for internal and external validation, respectively. GPT-4 led the LLM category with an F1 score of 0.43, while fine-tuning Mistral-7b significantly improved its recall from 1% to 79%, yielding a stable F1 score of 0.74 during external validation. Although LLMs showed moderate performance in zero-shot classification, fine-tuning substantially enhanced their effectiveness, potentially bridging the gap with CML models. However, CMLs still outperformed LLMs in handling high-dimensional tabular data tasks. This study highlights the potential of both CMLs and fine-tuned LLMs in medical predictive modeling, while emphasizing the current superiority of CMLs for structured data analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper empirically compares seven classical machine learning models (CMLs, e.g., XGBoost, random forest) against eight LLMs (e.g., GPT-4, Mistral-7b) on COVID-19 mortality prediction using high-dimensional tabular data from 9,134 patients across four hospitals. LLMs receive zero-shot prompts on text-converted records, with Mistral-7b additionally fine-tuned via QLoRA; CMLs achieve higher F1 scores (XGBoost 0.87 internal, RF 0.83 external) than LLMs (GPT-4 0.43, fine-tuned Mistral 0.74 external), leading to the conclusion that CMLs remain superior for structured tabular tasks while fine-tuning narrows the gap for LLMs.

Significance. If the performance gap is not an artifact of input encoding, the work supplies a sizable-cohort, multi-hospital validation benchmark that quantifies current LLM limitations on high-dimensional medical tabular data and demonstrates the practical value of QLoRA fine-tuning for recall improvement. The internal-plus-external validation design is a strength that supports generalizability claims.

major comments (3)

[Methods] Methods (data conversion and prompting): the manuscript provides no description of the exact natural-language encoding scheme, prompt templates, or handling of numerical precision and feature ordering when converting the 9,134-patient tabular records to text. Without these details or an ablation across alternative encodings (JSON, feature-value pairs, embeddings), the observed gap (XGBoost F1 0.87 vs. fine-tuned Mistral F1 0.74) cannot be confidently attributed to model class rather than representation.
[Results] Results (model comparison): no statistical tests, confidence intervals, or multiple-comparison corrections are reported for the F1 differences between CMLs and LLMs, nor for the recall jump from 1% to 79% after fine-tuning. This weakens the claim that CMLs “still outperformed LLMs.”
[Methods] Methods (class imbalance and preprocessing): the abstract and methods omit any description of class-imbalance handling, feature selection, or preprocessing steps applied to the high-dimensional tabular data before either CML training or LLM text conversion, both of which are load-bearing for the reported F1 scores.

minor comments (2)

[Abstract] Abstract contains a typographical error: “Mistral- 7b” should read “Mistral-7b.”
[Introduction] The paper should cite prior work on tabular-to-text prompting strategies for LLMs to situate the chosen encoding.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate additional methodological details and statistical reporting where appropriate.

read point-by-point responses

Referee: [Methods] Methods (data conversion and prompting): the manuscript provides no description of the exact natural-language encoding scheme, prompt templates, or handling of numerical precision and feature ordering when converting the 9,134-patient tabular records to text. Without these details or an ablation across alternative encodings (JSON, feature-value pairs, embeddings), the observed gap (XGBoost F1 0.87 vs. fine-tuned Mistral F1 0.74) cannot be confidently attributed to model class rather than representation.

Authors: We agree that the original manuscript omitted sufficient detail on the text conversion process. The revised version will include the exact prompt templates, a full description of the natural-language encoding (feature-value pairs with specified numerical formatting and ordering), and any relevant ablations or sensitivity checks performed during prompt design. This will allow readers to evaluate whether performance differences arise from representation choices. revision: yes
Referee: [Results] Results (model comparison): no statistical tests, confidence intervals, or multiple-comparison corrections are reported for the F1 differences between CMLs and LLMs, nor for the recall jump from 1% to 79% after fine-tuning. This weakens the claim that CMLs “still outperformed LLMs.”

Authors: We accept that the absence of confidence intervals and formal tests limits the strength of the comparison. In revision we will add bootstrap confidence intervals around all reported F1 and recall values and will note the lack of multiple-comparison correction as a limitation. The large cohort size makes the observed gaps (0.87 vs. 0.43; recall 1% to 79%) unlikely to be sampling artifacts, but the added quantification will address the concern directly. revision: yes
Referee: [Methods] Methods (class imbalance and preprocessing): the abstract and methods omit any description of class-imbalance handling, feature selection, or preprocessing steps applied to the high-dimensional tabular data before either CML training or LLM text conversion, both of which are load-bearing for the reported F1 scores.

Authors: We acknowledge the omission. The revised methods section will explicitly document class-imbalance handling (including any weighting or threshold adjustment used for CMLs), feature selection or scaling steps, and all preprocessing applied prior to both CML training and LLM text conversion. These details were used in the experiments and will now be reported in full. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential predictions

full rationale

The paper conducts an empirical benchmark of CMLs (XGBoost, RF, etc.) versus LLMs (GPT-4, Mistral-7b zero-shot and QLoRA fine-tuned) on a fixed 9,134-patient tabular COVID-19 dataset converted to text. All reported metrics (F1 scores, recall) are obtained directly from train/validation splits and external testing; no equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citation chains appear in the derivation of the central claim. The study is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised learning evaluation protocols and the untested premise that text conversion is an appropriate interface for LLMs on tabular inputs.

axioms (2)

domain assumption Mortality labels in the dataset are accurate and free of significant noise or misclassification.
The study treats recorded mortality as ground truth without discussing potential label error sources common in hospital data.
domain assumption Text conversion of tabular records preserves all predictive signal for LLM zero-shot and fine-tuned classification.
This premise underpins the entire LLM arm of the comparison described in the abstract.

pith-pipeline@v0.9.0 · 6052 in / 1279 out tokens · 31544 ms · 2026-05-23T21:12:57.707533+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

Embracing Large Language Models for Medical Applications: Opportunities and Challenges

Karabacak M, Margetis K. Embracing Large Language Models for Medical Applications: Opportunities and Challenges. Cureus. 2023

work page 2023
[2]

PLUG AND PLAY LANGUAGE MODELS: A SIMPLE APPROACH TO CONTROLLED TEXT GENERATION

Dathathri S, Madotto A, Lan J, Hung J, Frank E, Molino P, et al. PLUG AND PLAY LANGUAGE MODELS: A SIMPLE APPROACH TO CONTROLLED TEXT GENERATION. In: 8th International Conference on Learning Representations, ICLR 2020. 2020

work page 2020
[3]

Unsupervised neural machine translation with generative language models only

Han JM, Babuschkin I, Edwards H, Neelakantan A, Xu T, Polu S, et al. Unsupervised neural machine translation with generative language models only. arXiv preprint arXiv:211005448. 2021

work page 2021
[4]

Language models as knowledge bases? arXiv preprint arXiv:190901066

Petroni F, Rocktäschel T, Lewis P, Bakhtin A, Wu Y , Miller AH, et al. Language models as knowledge bases? arXiv preprint arXiv:190901066. 2019

work page 2019
[5]

Generative Large Language Models are autonomous practitioners of evidence-based medicine

Vaid A, Lampert J, Lee J, Sawant A, Apakama D, Sakhuja A, et al. Generative Large Language Models are autonomous practitioners of evidence-based medicine. arXiv preprint arXiv:240102851. 2024; 36

work page 2024
[6]

Combining structured and unstructured data for predictive models: a deep learning approach

Zhang D, Yin C, Zeng J, Yuan X, Zhang P. Combining structured and unstructured data for predictive models: a deep learning approach. BMC Med Inform Decis Mak. 2020;20:1–11

work page 2020
[7]

Challenges and best practices for digital unstructured data enrichment in health research: A systematic narrative review

Sedlakova J, Daniore P, Horn Wintsch A, Wolf M, Stanikic M, Haag C, et al. Challenges and best practices for digital unstructured data enrichment in health research: A systematic narrative review. PLOS Digital Health [Internet]. 2023 Oct 11;2(10):e00003 47-. Available from: https://doi.org/10.1371/journal.pdig.0000347

work page doi:10.1371/journal.pdig.0000347 2023
[8]

A survey of large language models in medicine: Progress, application, and challenge

Zhou H, Gu B, Zou X, Li Y , Chen SS, Zhou P, et al. A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:231105112. 2023

work page 2023
[9]

The shaky foundations of large language models and foundation models for electronic health records

Wornow M, Xu Y , Thapa R, Patel B, Steinberg E, Fleming S, et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med [Internet]. 2023;6(1):135. Available from: https://doi.org/10.1038/s41746-023-00879- 8

work page doi:10.1038/s41746-023-00879- 2023
[10]

Tabllm: Few -shot classification of tabular data with large language models

Hegselmann S, Buendia A, Lang H, Agrawal M, Jiang X, Sontag D. Tabllm: Few -shot classification of tabular data with large language models. In: International Conference on Artificial Intelligence and Statistics. PMLR; 2023. p. 5549–81

work page 2023
[11]

MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement

Wang Z, Gao C, Xiao C, Sun J. MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement. arXiv preprint arXiv:230512081. 2023

work page 2023
[12]

LLMs -based Few -Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction

Cui H, Shen Z, Zhang J, Shao H, Qin L, Ho JC, et al. LLMs -based Few -Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction. arXiv preprint arXiv:240315464. 2024; 37

work page 2024
[14]

Patel D, Timsina P, Gorenstein L, Glicksberg BS, Raut G, Cheetirala SN, et al. Comparative Analysis of a Large Language Model and Machine Learning Method for Prediction of Hospitalization from Nurse Triage Notes: Implications for Machine Learning -based Resource Management. medRxiv [Internet]. 2023 Jan 1;2023.08.07.23293699. Available from: http://medrxiv...

work page 2023
[15]

Scikit-learn: Machine learning in Python

Pedregosa F, Varoquaux G, Gramfort A, Michel V , Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research. 2011;12:2825–30

work page 2011
[16]

c -lasso--a Python package for constrained sparse and robust regression and classification

Simpson L, Combettes PL, Müller CL. c -lasso--a Python package for constrained sparse and robust regression and classification. arXiv preprint arXiv:201100898. 2020

work page 2020
[17]

Exploratory Undersampling for Class-Imbalance Learning

Liu XY , Wu J, Zhou ZH. Exploratory Undersampling for Class-Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 2009;39(2):539–50

work page 2009
[18]

BERT: Pre -training of Deep Bidirectional Transformers for Language Understanding

Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre -training of Deep Bidirectional Transformers for Language Understanding. In: North American Chapter of the Association for Computational Linguistics [Internet]. 2019. Available from: https://api.semanticscholar.org/CorpusID:52967399

work page 2019
[19]

Optimized glycemic control of type 2 diabetes with reinforcement learning: a proof -of-concept trial

Wang G, Liu X, Ying Z, Yang G, Chen Z, Liu Z, et al. Optimized glycemic control of type 2 diabetes with reinforcement learning: a proof -of-concept trial. Nat Med [Internet]. 2023;29(10):2633–42. Available from: https://doi.org/10.1038/s41591-023-02552-9 38

work page doi:10.1038/s41591-023-02552-9 2023
[20]

Qlora: Efficient finetuning of quantized llms

Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. Qlora: Efficient finetuning of quantized llms. Adv Neural Inf Process Syst. 2024;36

work page 2024
[21]

Nazary F, Deldjoo Y , Di Noia T, di Sciascio E. XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In -Context Learning in Healthcare. arXiv preprint arXiv:240506270. 2024

work page 2024
[22]

Large language models are zero -shot reasoners

Kojima T, Gu SS, Reid M, Matsuo Y , Iwasawa Y . Large language models are zero -shot reasoners. Adv Neural Inf Process Syst. 2022;35:22199–213

work page 2022
[23]

Language models are few-shot learners

Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–901

work page 2020
[24]

Parameter -efficient fine -tuning for large models: A comprehensive survey

Han Z, Gao C, Liu J, Zhang SQ. Parameter -efficient fine -tuning for large models: A comprehensive survey. arXiv preprint arXiv:240314608. 2024

work page 2024
[25]

Automated classification of brain MRI reports using fine-tuned large language models

Kanzawa J, Yasaka K, Fujita N, Fujiwara S, Abe O. Automated classification of brain MRI reports using fine-tuned large language models. Neuroradiology [Internet]. 2024; Available from: https://doi.org/10.1007/s00234-024-03427-7

work page doi:10.1007/s00234-024-03427-7 2024
[26]

Human -Like Named Entity Recognition with Large Language Models in Unstructured Text -based Electronic Healthcare Records: An Evaluation Study

Akbasli IT, Birbilen AZ, Teksam O. Human -Like Named Entity Recognition with Large Language Models in Unstructured Text -based Electronic Healthcare Records: An Evaluation Study. 2024

work page 2024
[27]

Amplifying Limitations, Harms and Risks of Large Language Models

O’Neill M, Connor M. Amplifying Limitations, Harms and Risks of Large Language Models. arXiv preprint arXiv:230704821. 2023

work page 2023
[28]

Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and 39 Treatment

Savage T, Wang J, Gallo R, Boukil A, Patel V , Ahmad Safavi -Naini SA, et al. Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and 39 Treatment. medRxiv [Internet]. 2024 Jan 1;2024.06.06.24308399. Available from: http://medrxiv.org/content/early/2024/06/10/2024.06.06.24308399.abstract

work page 2024
[29]

Privacy -preserving data sharing infrastructures for medical research: systematization and comparison

Wirth FN, Meurers T, Johns M, Prasser F. Privacy -preserving data sharing infrastructures for medical research: systematization and comparison. BMC Med Inform Decis Mak [Internet]. 2021;21(1):242. Available from: https://doi.org/10.1186/s12911-021-01602-x

work page doi:10.1186/s12911-021-01602-x 2021
[30]

Introducing L2M3, A Multilingual Medical Large Language Model to Advance Health Equity in Low-Resource Regions

Gangavarapu A. Introducing L2M3, A Multilingual Medical Large Language Model to Advance Health Equity in Low-Resource Regions. arXiv preprint arXiv:240408705. 2024

work page 2024
[31]

Mining Clinical Notes for Physical Rehabilitation Exercise Information: Natural Language Processing Algorithm Development and Validation Study

Sivarajkumar S, Gao F, Denny P, Aldhahwani B, Visweswaran S, Bove A, et al. Mining Clinical Notes for Physical Rehabilitation Exercise Information: Natural Language Processing Algorithm Development and Validation Study. JMIR Med Inform [Internet]. 2024;12:e52289. Available from: http://dx.doi.org/10.2196/52289

work page doi:10.2196/52289 2024
[32]

Large Language Models versus Classical Machine Learning: 1 Performance on COVID-19 Mortality Prediction Using High-Dimensional Tabular Data

Chen S, Li Y , Lu S, Van H, Aerts HJWL, Savova GK, et al. Evaluating the ChatGPT family of models for biomedical reasoning and classification. J Am Med Inform Assoc [Internet]. 2024;31(4):940–8. Available from: http://dx.doi.org/10.1093/jamia/ocad256 1 This is a supplementary file to "Large Language Models versus Classical Machine Learning: 1 Performance ...

work page doi:10.1093/jamia/ocad256 2024

[1] [1]

Embracing Large Language Models for Medical Applications: Opportunities and Challenges

Karabacak M, Margetis K. Embracing Large Language Models for Medical Applications: Opportunities and Challenges. Cureus. 2023

work page 2023

[2] [2]

PLUG AND PLAY LANGUAGE MODELS: A SIMPLE APPROACH TO CONTROLLED TEXT GENERATION

Dathathri S, Madotto A, Lan J, Hung J, Frank E, Molino P, et al. PLUG AND PLAY LANGUAGE MODELS: A SIMPLE APPROACH TO CONTROLLED TEXT GENERATION. In: 8th International Conference on Learning Representations, ICLR 2020. 2020

work page 2020

[3] [3]

Unsupervised neural machine translation with generative language models only

Han JM, Babuschkin I, Edwards H, Neelakantan A, Xu T, Polu S, et al. Unsupervised neural machine translation with generative language models only. arXiv preprint arXiv:211005448. 2021

work page 2021

[4] [4]

Language models as knowledge bases? arXiv preprint arXiv:190901066

Petroni F, Rocktäschel T, Lewis P, Bakhtin A, Wu Y , Miller AH, et al. Language models as knowledge bases? arXiv preprint arXiv:190901066. 2019

work page 2019

[5] [5]

Generative Large Language Models are autonomous practitioners of evidence-based medicine

Vaid A, Lampert J, Lee J, Sawant A, Apakama D, Sakhuja A, et al. Generative Large Language Models are autonomous practitioners of evidence-based medicine. arXiv preprint arXiv:240102851. 2024; 36

work page 2024

[6] [6]

Combining structured and unstructured data for predictive models: a deep learning approach

Zhang D, Yin C, Zeng J, Yuan X, Zhang P. Combining structured and unstructured data for predictive models: a deep learning approach. BMC Med Inform Decis Mak. 2020;20:1–11

work page 2020

[7] [7]

Challenges and best practices for digital unstructured data enrichment in health research: A systematic narrative review

Sedlakova J, Daniore P, Horn Wintsch A, Wolf M, Stanikic M, Haag C, et al. Challenges and best practices for digital unstructured data enrichment in health research: A systematic narrative review. PLOS Digital Health [Internet]. 2023 Oct 11;2(10):e00003 47-. Available from: https://doi.org/10.1371/journal.pdig.0000347

work page doi:10.1371/journal.pdig.0000347 2023

[8] [8]

A survey of large language models in medicine: Progress, application, and challenge

Zhou H, Gu B, Zou X, Li Y , Chen SS, Zhou P, et al. A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:231105112. 2023

work page 2023

[9] [9]

The shaky foundations of large language models and foundation models for electronic health records

Wornow M, Xu Y , Thapa R, Patel B, Steinberg E, Fleming S, et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med [Internet]. 2023;6(1):135. Available from: https://doi.org/10.1038/s41746-023-00879- 8

work page doi:10.1038/s41746-023-00879- 2023

[10] [10]

Tabllm: Few -shot classification of tabular data with large language models

Hegselmann S, Buendia A, Lang H, Agrawal M, Jiang X, Sontag D. Tabllm: Few -shot classification of tabular data with large language models. In: International Conference on Artificial Intelligence and Statistics. PMLR; 2023. p. 5549–81

work page 2023

[11] [11]

MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement

Wang Z, Gao C, Xiao C, Sun J. MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement. arXiv preprint arXiv:230512081. 2023

work page 2023

[12] [12]

LLMs -based Few -Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction

Cui H, Shen Z, Zhang J, Shao H, Qin L, Ho JC, et al. LLMs -based Few -Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction. arXiv preprint arXiv:240315464. 2024; 37

work page 2024

[13] [14]

Patel D, Timsina P, Gorenstein L, Glicksberg BS, Raut G, Cheetirala SN, et al. Comparative Analysis of a Large Language Model and Machine Learning Method for Prediction of Hospitalization from Nurse Triage Notes: Implications for Machine Learning -based Resource Management. medRxiv [Internet]. 2023 Jan 1;2023.08.07.23293699. Available from: http://medrxiv...

work page 2023

[14] [15]

Scikit-learn: Machine learning in Python

Pedregosa F, Varoquaux G, Gramfort A, Michel V , Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research. 2011;12:2825–30

work page 2011

[15] [16]

c -lasso--a Python package for constrained sparse and robust regression and classification

Simpson L, Combettes PL, Müller CL. c -lasso--a Python package for constrained sparse and robust regression and classification. arXiv preprint arXiv:201100898. 2020

work page 2020

[16] [17]

Exploratory Undersampling for Class-Imbalance Learning

Liu XY , Wu J, Zhou ZH. Exploratory Undersampling for Class-Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 2009;39(2):539–50

work page 2009

[17] [18]

BERT: Pre -training of Deep Bidirectional Transformers for Language Understanding

Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre -training of Deep Bidirectional Transformers for Language Understanding. In: North American Chapter of the Association for Computational Linguistics [Internet]. 2019. Available from: https://api.semanticscholar.org/CorpusID:52967399

work page 2019

[18] [19]

Optimized glycemic control of type 2 diabetes with reinforcement learning: a proof -of-concept trial

Wang G, Liu X, Ying Z, Yang G, Chen Z, Liu Z, et al. Optimized glycemic control of type 2 diabetes with reinforcement learning: a proof -of-concept trial. Nat Med [Internet]. 2023;29(10):2633–42. Available from: https://doi.org/10.1038/s41591-023-02552-9 38

work page doi:10.1038/s41591-023-02552-9 2023

[19] [20]

Qlora: Efficient finetuning of quantized llms

Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. Qlora: Efficient finetuning of quantized llms. Adv Neural Inf Process Syst. 2024;36

work page 2024

[20] [21]

Nazary F, Deldjoo Y , Di Noia T, di Sciascio E. XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In -Context Learning in Healthcare. arXiv preprint arXiv:240506270. 2024

work page 2024

[21] [22]

Large language models are zero -shot reasoners

Kojima T, Gu SS, Reid M, Matsuo Y , Iwasawa Y . Large language models are zero -shot reasoners. Adv Neural Inf Process Syst. 2022;35:22199–213

work page 2022

[22] [23]

Language models are few-shot learners

Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–901

work page 2020

[23] [24]

Parameter -efficient fine -tuning for large models: A comprehensive survey

Han Z, Gao C, Liu J, Zhang SQ. Parameter -efficient fine -tuning for large models: A comprehensive survey. arXiv preprint arXiv:240314608. 2024

work page 2024

[24] [25]

Automated classification of brain MRI reports using fine-tuned large language models

Kanzawa J, Yasaka K, Fujita N, Fujiwara S, Abe O. Automated classification of brain MRI reports using fine-tuned large language models. Neuroradiology [Internet]. 2024; Available from: https://doi.org/10.1007/s00234-024-03427-7

work page doi:10.1007/s00234-024-03427-7 2024

[25] [26]

Human -Like Named Entity Recognition with Large Language Models in Unstructured Text -based Electronic Healthcare Records: An Evaluation Study

Akbasli IT, Birbilen AZ, Teksam O. Human -Like Named Entity Recognition with Large Language Models in Unstructured Text -based Electronic Healthcare Records: An Evaluation Study. 2024

work page 2024

[26] [27]

Amplifying Limitations, Harms and Risks of Large Language Models

O’Neill M, Connor M. Amplifying Limitations, Harms and Risks of Large Language Models. arXiv preprint arXiv:230704821. 2023

work page 2023

[27] [28]

Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and 39 Treatment

Savage T, Wang J, Gallo R, Boukil A, Patel V , Ahmad Safavi -Naini SA, et al. Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and 39 Treatment. medRxiv [Internet]. 2024 Jan 1;2024.06.06.24308399. Available from: http://medrxiv.org/content/early/2024/06/10/2024.06.06.24308399.abstract

work page 2024

[28] [29]

Privacy -preserving data sharing infrastructures for medical research: systematization and comparison

Wirth FN, Meurers T, Johns M, Prasser F. Privacy -preserving data sharing infrastructures for medical research: systematization and comparison. BMC Med Inform Decis Mak [Internet]. 2021;21(1):242. Available from: https://doi.org/10.1186/s12911-021-01602-x

work page doi:10.1186/s12911-021-01602-x 2021

[29] [30]

Introducing L2M3, A Multilingual Medical Large Language Model to Advance Health Equity in Low-Resource Regions

Gangavarapu A. Introducing L2M3, A Multilingual Medical Large Language Model to Advance Health Equity in Low-Resource Regions. arXiv preprint arXiv:240408705. 2024

work page 2024

[30] [31]

Mining Clinical Notes for Physical Rehabilitation Exercise Information: Natural Language Processing Algorithm Development and Validation Study

Sivarajkumar S, Gao F, Denny P, Aldhahwani B, Visweswaran S, Bove A, et al. Mining Clinical Notes for Physical Rehabilitation Exercise Information: Natural Language Processing Algorithm Development and Validation Study. JMIR Med Inform [Internet]. 2024;12:e52289. Available from: http://dx.doi.org/10.2196/52289

work page doi:10.2196/52289 2024

[31] [32]

Large Language Models versus Classical Machine Learning: 1 Performance on COVID-19 Mortality Prediction Using High-Dimensional Tabular Data

Chen S, Li Y , Lu S, Van H, Aerts HJWL, Savova GK, et al. Evaluating the ChatGPT family of models for biomedical reasoning and classification. J Am Med Inform Assoc [Internet]. 2024;31(4):940–8. Available from: http://dx.doi.org/10.1093/jamia/ocad256 1 This is a supplementary file to "Large Language Models versus Classical Machine Learning: 1 Performance ...

work page doi:10.1093/jamia/ocad256 2024