Large Language Models versus Classical Machine Learning: Performance in COVID-19 Mortality Prediction Using High-Dimensional Tabular Data
Pith reviewed 2026-05-23 21:12 UTC · model grok-4.3
The pith
Classical machine learning models outperform large language models when predicting COVID-19 mortality from high-dimensional tabular patient records.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
XGBoost and random forest demonstrated superior performance among CMLs, achieving F1 scores of 0.87 and 0.83 for internal and external validation, respectively. GPT-4 led the LLM category with an F1 score of 0.43, while fine-tuning Mistral-7b significantly improved its recall from 1% to 79%, yielding a stable F1 score of 0.74 during external validation. Although LLMs showed moderate performance in zero-shot classification, fine-tuning substantially enhanced their effectiveness, potentially bridging the gap with CML models. However, CMLs still outperformed LLMs in handling high-dimensional tabular data tasks.
What carries the argument
Comparison of F1, precision and recall between CML baselines (XGBoost, random forest) and LLMs on tabular records converted to text, including zero-shot prompting versus QLoRA fine-tuning of Mistral-7b.
If this is right
- Fine-tuning via QLoRA raises LLM recall from 1% to 79% and produces a stable external F1 of 0.74.
- CMLs remain the stronger choice for accuracy on high-dimensional structured medical data.
- Both CMLs and fine-tuned LLMs can support medical predictive modeling tasks.
- Text conversion of tabular records works as an input method but does not close the full performance gap.
Where Pith is reading between the lines
- Native tabular input formats for LLMs could narrow or eliminate the remaining accuracy difference observed here.
- The results indicate that even on structured data, LLMs gain substantially from task-specific fine-tuning rather than zero-shot use.
- Repeating the comparison on non-COVID tabular medical datasets would test whether the observed CML advantage generalizes.
Load-bearing premise
Converting tabular patient records into natural language text provides a fair and representative test of LLM capability on structured high-dimensional data.
What would settle it
An experiment that feeds the same patient features directly into an LLM in native tabular format and produces F1 scores matching or exceeding XGBoost on the identical external validation set would falsify the superiority claim.
Figures
read the original abstract
This study compared the performance of classical feature-based machine learning models (CMLs) and large language models (LLMs) in predicting COVID-19 mortality using high-dimensional tabular data from 9,134 patients across four hospitals. Seven CML models, including XGBoost and random forest (RF), were evaluated alongside eight LLMs, such as GPT-4 and Mistral-7b, which performed zero-shot classification on text-converted structured data. Additionally, Mistral- 7b was fine-tuned using the QLoRA approach. XGBoost and RF demonstrated superior performance among CMLs, achieving F1 scores of 0.87 and 0.83 for internal and external validation, respectively. GPT-4 led the LLM category with an F1 score of 0.43, while fine-tuning Mistral-7b significantly improved its recall from 1% to 79%, yielding a stable F1 score of 0.74 during external validation. Although LLMs showed moderate performance in zero-shot classification, fine-tuning substantially enhanced their effectiveness, potentially bridging the gap with CML models. However, CMLs still outperformed LLMs in handling high-dimensional tabular data tasks. This study highlights the potential of both CMLs and fine-tuned LLMs in medical predictive modeling, while emphasizing the current superiority of CMLs for structured data analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically compares seven classical machine learning models (CMLs, e.g., XGBoost, random forest) against eight LLMs (e.g., GPT-4, Mistral-7b) on COVID-19 mortality prediction using high-dimensional tabular data from 9,134 patients across four hospitals. LLMs receive zero-shot prompts on text-converted records, with Mistral-7b additionally fine-tuned via QLoRA; CMLs achieve higher F1 scores (XGBoost 0.87 internal, RF 0.83 external) than LLMs (GPT-4 0.43, fine-tuned Mistral 0.74 external), leading to the conclusion that CMLs remain superior for structured tabular tasks while fine-tuning narrows the gap for LLMs.
Significance. If the performance gap is not an artifact of input encoding, the work supplies a sizable-cohort, multi-hospital validation benchmark that quantifies current LLM limitations on high-dimensional medical tabular data and demonstrates the practical value of QLoRA fine-tuning for recall improvement. The internal-plus-external validation design is a strength that supports generalizability claims.
major comments (3)
- [Methods] Methods (data conversion and prompting): the manuscript provides no description of the exact natural-language encoding scheme, prompt templates, or handling of numerical precision and feature ordering when converting the 9,134-patient tabular records to text. Without these details or an ablation across alternative encodings (JSON, feature-value pairs, embeddings), the observed gap (XGBoost F1 0.87 vs. fine-tuned Mistral F1 0.74) cannot be confidently attributed to model class rather than representation.
- [Results] Results (model comparison): no statistical tests, confidence intervals, or multiple-comparison corrections are reported for the F1 differences between CMLs and LLMs, nor for the recall jump from 1% to 79% after fine-tuning. This weakens the claim that CMLs “still outperformed LLMs.”
- [Methods] Methods (class imbalance and preprocessing): the abstract and methods omit any description of class-imbalance handling, feature selection, or preprocessing steps applied to the high-dimensional tabular data before either CML training or LLM text conversion, both of which are load-bearing for the reported F1 scores.
minor comments (2)
- [Abstract] Abstract contains a typographical error: “Mistral- 7b” should read “Mistral-7b.”
- [Introduction] The paper should cite prior work on tabular-to-text prompting strategies for LLMs to situate the chosen encoding.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate additional methodological details and statistical reporting where appropriate.
read point-by-point responses
-
Referee: [Methods] Methods (data conversion and prompting): the manuscript provides no description of the exact natural-language encoding scheme, prompt templates, or handling of numerical precision and feature ordering when converting the 9,134-patient tabular records to text. Without these details or an ablation across alternative encodings (JSON, feature-value pairs, embeddings), the observed gap (XGBoost F1 0.87 vs. fine-tuned Mistral F1 0.74) cannot be confidently attributed to model class rather than representation.
Authors: We agree that the original manuscript omitted sufficient detail on the text conversion process. The revised version will include the exact prompt templates, a full description of the natural-language encoding (feature-value pairs with specified numerical formatting and ordering), and any relevant ablations or sensitivity checks performed during prompt design. This will allow readers to evaluate whether performance differences arise from representation choices. revision: yes
-
Referee: [Results] Results (model comparison): no statistical tests, confidence intervals, or multiple-comparison corrections are reported for the F1 differences between CMLs and LLMs, nor for the recall jump from 1% to 79% after fine-tuning. This weakens the claim that CMLs “still outperformed LLMs.”
Authors: We accept that the absence of confidence intervals and formal tests limits the strength of the comparison. In revision we will add bootstrap confidence intervals around all reported F1 and recall values and will note the lack of multiple-comparison correction as a limitation. The large cohort size makes the observed gaps (0.87 vs. 0.43; recall 1% to 79%) unlikely to be sampling artifacts, but the added quantification will address the concern directly. revision: yes
-
Referee: [Methods] Methods (class imbalance and preprocessing): the abstract and methods omit any description of class-imbalance handling, feature selection, or preprocessing steps applied to the high-dimensional tabular data before either CML training or LLM text conversion, both of which are load-bearing for the reported F1 scores.
Authors: We acknowledge the omission. The revised methods section will explicitly document class-imbalance handling (including any weighting or threshold adjustment used for CMLs), feature selection or scaling steps, and all preprocessing applied prior to both CML training and LLM text conversion. These details were used in the experiments and will now be reported in full. revision: yes
Circularity Check
No circularity: purely empirical comparison with no derivations or self-referential predictions
full rationale
The paper conducts an empirical benchmark of CMLs (XGBoost, RF, etc.) versus LLMs (GPT-4, Mistral-7b zero-shot and QLoRA fine-tuned) on a fixed 9,134-patient tabular COVID-19 dataset converted to text. All reported metrics (F1 scores, recall) are obtained directly from train/validation splits and external testing; no equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citation chains appear in the derivation of the central claim. The study is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Mortality labels in the dataset are accurate and free of significant noise or misclassification.
- domain assumption Text conversion of tabular records preserves all predictive signal for LLM zero-shot and fine-tuned classification.
Reference graph
Works this paper leans on
-
[1]
Embracing Large Language Models for Medical Applications: Opportunities and Challenges
Karabacak M, Margetis K. Embracing Large Language Models for Medical Applications: Opportunities and Challenges. Cureus. 2023
work page 2023
-
[2]
PLUG AND PLAY LANGUAGE MODELS: A SIMPLE APPROACH TO CONTROLLED TEXT GENERATION
Dathathri S, Madotto A, Lan J, Hung J, Frank E, Molino P, et al. PLUG AND PLAY LANGUAGE MODELS: A SIMPLE APPROACH TO CONTROLLED TEXT GENERATION. In: 8th International Conference on Learning Representations, ICLR 2020. 2020
work page 2020
-
[3]
Unsupervised neural machine translation with generative language models only
Han JM, Babuschkin I, Edwards H, Neelakantan A, Xu T, Polu S, et al. Unsupervised neural machine translation with generative language models only. arXiv preprint arXiv:211005448. 2021
work page 2021
-
[4]
Language models as knowledge bases? arXiv preprint arXiv:190901066
Petroni F, Rocktäschel T, Lewis P, Bakhtin A, Wu Y , Miller AH, et al. Language models as knowledge bases? arXiv preprint arXiv:190901066. 2019
work page 2019
-
[5]
Generative Large Language Models are autonomous practitioners of evidence-based medicine
Vaid A, Lampert J, Lee J, Sawant A, Apakama D, Sakhuja A, et al. Generative Large Language Models are autonomous practitioners of evidence-based medicine. arXiv preprint arXiv:240102851. 2024; 36
work page 2024
-
[6]
Combining structured and unstructured data for predictive models: a deep learning approach
Zhang D, Yin C, Zeng J, Yuan X, Zhang P. Combining structured and unstructured data for predictive models: a deep learning approach. BMC Med Inform Decis Mak. 2020;20:1–11
work page 2020
-
[7]
Sedlakova J, Daniore P, Horn Wintsch A, Wolf M, Stanikic M, Haag C, et al. Challenges and best practices for digital unstructured data enrichment in health research: A systematic narrative review. PLOS Digital Health [Internet]. 2023 Oct 11;2(10):e00003 47-. Available from: https://doi.org/10.1371/journal.pdig.0000347
-
[8]
A survey of large language models in medicine: Progress, application, and challenge
Zhou H, Gu B, Zou X, Li Y , Chen SS, Zhou P, et al. A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:231105112. 2023
work page 2023
-
[9]
The shaky foundations of large language models and foundation models for electronic health records
Wornow M, Xu Y , Thapa R, Patel B, Steinberg E, Fleming S, et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med [Internet]. 2023;6(1):135. Available from: https://doi.org/10.1038/s41746-023-00879- 8
-
[10]
Tabllm: Few -shot classification of tabular data with large language models
Hegselmann S, Buendia A, Lang H, Agrawal M, Jiang X, Sontag D. Tabllm: Few -shot classification of tabular data with large language models. In: International Conference on Artificial Intelligence and Statistics. PMLR; 2023. p. 5549–81
work page 2023
-
[11]
MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement
Wang Z, Gao C, Xiao C, Sun J. MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement. arXiv preprint arXiv:230512081. 2023
work page 2023
-
[12]
Cui H, Shen Z, Zhang J, Shao H, Qin L, Ho JC, et al. LLMs -based Few -Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction. arXiv preprint arXiv:240315464. 2024; 37
work page 2024
-
[14]
Patel D, Timsina P, Gorenstein L, Glicksberg BS, Raut G, Cheetirala SN, et al. Comparative Analysis of a Large Language Model and Machine Learning Method for Prediction of Hospitalization from Nurse Triage Notes: Implications for Machine Learning -based Resource Management. medRxiv [Internet]. 2023 Jan 1;2023.08.07.23293699. Available from: http://medrxiv...
work page 2023
-
[15]
Scikit-learn: Machine learning in Python
Pedregosa F, Varoquaux G, Gramfort A, Michel V , Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research. 2011;12:2825–30
work page 2011
-
[16]
c -lasso--a Python package for constrained sparse and robust regression and classification
Simpson L, Combettes PL, Müller CL. c -lasso--a Python package for constrained sparse and robust regression and classification. arXiv preprint arXiv:201100898. 2020
work page 2020
-
[17]
Exploratory Undersampling for Class-Imbalance Learning
Liu XY , Wu J, Zhou ZH. Exploratory Undersampling for Class-Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 2009;39(2):539–50
work page 2009
-
[18]
BERT: Pre -training of Deep Bidirectional Transformers for Language Understanding
Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre -training of Deep Bidirectional Transformers for Language Understanding. In: North American Chapter of the Association for Computational Linguistics [Internet]. 2019. Available from: https://api.semanticscholar.org/CorpusID:52967399
work page 2019
-
[19]
Optimized glycemic control of type 2 diabetes with reinforcement learning: a proof -of-concept trial
Wang G, Liu X, Ying Z, Yang G, Chen Z, Liu Z, et al. Optimized glycemic control of type 2 diabetes with reinforcement learning: a proof -of-concept trial. Nat Med [Internet]. 2023;29(10):2633–42. Available from: https://doi.org/10.1038/s41591-023-02552-9 38
-
[20]
Qlora: Efficient finetuning of quantized llms
Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. Qlora: Efficient finetuning of quantized llms. Adv Neural Inf Process Syst. 2024;36
work page 2024
-
[21]
Nazary F, Deldjoo Y , Di Noia T, di Sciascio E. XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In -Context Learning in Healthcare. arXiv preprint arXiv:240506270. 2024
work page 2024
-
[22]
Large language models are zero -shot reasoners
Kojima T, Gu SS, Reid M, Matsuo Y , Iwasawa Y . Large language models are zero -shot reasoners. Adv Neural Inf Process Syst. 2022;35:22199–213
work page 2022
-
[23]
Language models are few-shot learners
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–901
work page 2020
-
[24]
Parameter -efficient fine -tuning for large models: A comprehensive survey
Han Z, Gao C, Liu J, Zhang SQ. Parameter -efficient fine -tuning for large models: A comprehensive survey. arXiv preprint arXiv:240314608. 2024
work page 2024
-
[25]
Automated classification of brain MRI reports using fine-tuned large language models
Kanzawa J, Yasaka K, Fujita N, Fujiwara S, Abe O. Automated classification of brain MRI reports using fine-tuned large language models. Neuroradiology [Internet]. 2024; Available from: https://doi.org/10.1007/s00234-024-03427-7
-
[26]
Akbasli IT, Birbilen AZ, Teksam O. Human -Like Named Entity Recognition with Large Language Models in Unstructured Text -based Electronic Healthcare Records: An Evaluation Study. 2024
work page 2024
-
[27]
Amplifying Limitations, Harms and Risks of Large Language Models
O’Neill M, Connor M. Amplifying Limitations, Harms and Risks of Large Language Models. arXiv preprint arXiv:230704821. 2023
work page 2023
-
[28]
Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and 39 Treatment
Savage T, Wang J, Gallo R, Boukil A, Patel V , Ahmad Safavi -Naini SA, et al. Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and 39 Treatment. medRxiv [Internet]. 2024 Jan 1;2024.06.06.24308399. Available from: http://medrxiv.org/content/early/2024/06/10/2024.06.06.24308399.abstract
work page 2024
-
[29]
Wirth FN, Meurers T, Johns M, Prasser F. Privacy -preserving data sharing infrastructures for medical research: systematization and comparison. BMC Med Inform Decis Mak [Internet]. 2021;21(1):242. Available from: https://doi.org/10.1186/s12911-021-01602-x
-
[30]
Gangavarapu A. Introducing L2M3, A Multilingual Medical Large Language Model to Advance Health Equity in Low-Resource Regions. arXiv preprint arXiv:240408705. 2024
work page 2024
-
[31]
Sivarajkumar S, Gao F, Denny P, Aldhahwani B, Visweswaran S, Bove A, et al. Mining Clinical Notes for Physical Rehabilitation Exercise Information: Natural Language Processing Algorithm Development and Validation Study. JMIR Med Inform [Internet]. 2024;12:e52289. Available from: http://dx.doi.org/10.2196/52289
-
[32]
Chen S, Li Y , Lu S, Van H, Aerts HJWL, Savova GK, et al. Evaluating the ChatGPT family of models for biomedical reasoning and classification. J Am Med Inform Assoc [Internet]. 2024;31(4):940–8. Available from: http://dx.doi.org/10.1093/jamia/ocad256 1 This is a supplementary file to "Large Language Models versus Classical Machine Learning: 1 Performance ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.