On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets

Babu Sena Paul; Luke Oluwaseye Joel; Wesley Doorsamy

arxiv: 2403.14687 · v1 · pith:YNQW7LCHnew · submitted 2024-03-13 · 💻 cs.LG · cs.AI

On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets

Luke Oluwaseye Joel , Wesley Doorsamy , Babu Sena Paul This is my paper

classification 💻 cs.LG cs.AI

keywords imputationmissingvaluesdatadatasetshealthcaremeanperform

0 comments

read the original abstract

Missing values or data is one popular characteristic of real-world datasets, especially healthcare data. This could be frustrating when using machine learning algorithms on such datasets, simply because most machine learning models perform poorly in the presence of missing values. The aim of this study is to compare the performance of seven imputation techniques, namely Mean imputation, Median Imputation, Last Observation carried Forward (LOCF) imputation, K-Nearest Neighbor (KNN) imputation, Interpolation imputation, Missforest imputation, and Multiple imputation by Chained Equations (MICE), on three healthcare datasets. Some percentage of missing values - 10\%, 15\%, 20\% and 25\% - were introduced into the dataset, and the imputation techniques were employed to impute these missing values. The comparison of their performance was evaluated by using root mean squared error (RMSE) and mean absolute error (MAE). The results show that Missforest imputation performs the best followed by MICE imputation. Additionally, we try to determine whether it is better to perform feature selection before imputation or vice versa by using the following metrics - the recall, precision, f1-score and accuracy. Due to the fact that there are few literature on this and some debate on the subject among researchers, we hope that the results from this experiment will encourage data scientists and researchers to perform imputation first before feature selection when dealing with data containing missing values.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Predicting Fetal Birthweight from High Dimensional Data using Advanced Machine Learning
cs.LG 2025-02 unverdicted novelty 3.0

Machine learning pipeline with MICE imputation, tree-based feature selection, and ensemble models predicts birth weight, claiming improved performance on constrained clinical datasets.