arxiv: 2605.14147 · v1 · pith:ZLZTKUJAnew · submitted 2026-05-13 · 💻 cs.LG

A Systematic Evaluation of Imbalance Handling Methods in Biomedical Binary Classification

Jiandong Chen , Lingjie Su , Le Peng , Yash Travadi , Rui Zhang , Ju Sun This is my paper

Pith reviewed 2026-05-15 04:47 UTC · model grok-4.3

classification 💻 cs.LG

keywords imbalance handling methodsbiomedical binary classificationmodel complexitydata modalitiesrandom oversamplingSMOTEre-weightingF1-score optimization

0 comments

The pith

Imbalance handling boosts complex models on unstructured biomedical data but harms simple ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests five standard techniques for dealing with imbalanced classes in biomedical classification problems. The tests cover three datasets in tabular, text, and image formats and models from basic to advanced deep learning. Results indicate no gain from any technique when using simple models on tabular data, but random oversampling and re-weighting help complex models on text and images while undersampling and SMOTE reduce accuracy. Readers should care because choosing the right or wrong technique directly affects how well medical prediction systems work in practice. The findings provide guidance on when these techniques are worth applying.

Core claim

The effectiveness of imbalance handling methods depends on both model complexity and data modality. For simpler models such as logistic regression on tabular data, IHMs yielded no significant advantage over the RAW baseline. However, clear benefits were observed for more complex models and unstructured data with ROS and RW consistently enhancing the performance of powerful models, direct F1-score optimization demonstrating utility primarily for unstructured text and image data, and RUS and SMOTE consistently degrading performance and therefore not recommended.

What carries the argument

The evaluation of five imbalance handling methods—random undersampling, random oversampling, SMOTE, re-weighting, and direct F1-score optimization—across model complexities and data modalities.

If this is right

ROS and RW enhance performance of powerful models on text and image data.
DMO is useful mainly for unstructured data.
RUS and SMOTE degrade performance and should be avoided.
No IHM advantage for simple models on tabular data.
Performance gains are most pronounced with high-complexity models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results suggest testing re-weighting early in model development for medical imaging tasks.
Similar patterns might appear in other domains with complex models like audio or video medical data.
Additional studies with more datasets could verify if these recommendations hold for rare disease detection.

Load-bearing premise

The three public biomedical datasets and the range of model architectures chosen are representative of typical biomedical binary classification problems.

What would settle it

A replication study on a different biomedical dataset, such as chest X-rays or electronic health records with a complex model, showing improved performance from SMOTE would falsify the claim that it degrades results.

Figures

Figures reproduced from arXiv: 2605.14147 by Jiandong Chen, Ju Sun, Le Peng, Lingjie Su, Rui Zhang, Yash Travadi.

read the original abstract

Objective: The primary goal of this study was to systematically examine the impact of commonly used imbalance handling methods (IHMs) on predictive performance in biomedical binary classification, considering the interplay between model complexity and diverse data modalities. Material and Methods: We evaluated five representative IHMs: random undersampling (RUS), random oversampling (ROS), SMOTE, re-weighting (RW), and direct F1-score optimization (DMO), against a raw training (RAW) baseline. The evaluation encompassed three public biomedical datasets: MIMIC-III (tabular), ADE-Corpus-V2 (text), and MURA (image), spanning three common biomedical data modalities. To assess varying model complexity, we employed a range of architectures, from classical logistic regression and random forest to deep neural networks, including multilayer perceptron (MLP), BiLSTM, BERT, DenseNet, and DINOv2. Results: For simpler models such as logistic regression on tabular data, IHMs yielded no significant advantage over the RAW baseline, aligning with prior findings. However, clear benefits were observed for more complex models and unstructured data: (a) ROS and RW consistently enhanced the performance of powerful models; (b) direct F1-score optimization demonstrated utility primarily for unstructured text and image data; and (c) RUS and SMOTE consistently degraded performance and are therefore not recommended. Conclusion: The effectiveness of IHMs depends on both model complexity and data modality. Performance gains are most pronounced when leveraging appropriate IHMs, such as ROS, RW, and DMO, on high-complexity models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This benchmarking paper finds that ROS and reweighting help complex models on text and image biomedical data while RUS and SMOTE hurt, but simple models gain nothing and the patterns rest on just three datasets.

read the letter

The main thing to know is that this study finds imbalance handling is not uniformly useful in biomedical binary classification. For simple models like logistic regression on tabular data, none of the five methods beat the raw baseline. For deeper models on text and image data, random oversampling and reweighting improve performance, direct F1 optimization helps on unstructured modalities, and random undersampling plus SMOTE degrade results across the board. The authors treat this as practical guidance rather than new theory.

Referee Report

2 major / 2 minor

Summary. The manuscript conducts a systematic empirical study evaluating five imbalance handling methods (RUS, ROS, SMOTE, RW, DMO) against a raw baseline on three public biomedical datasets spanning tabular (MIMIC-III), text (ADE-Corpus-V2), and image (MURA) modalities. Using models from logistic regression and random forests to deep architectures like MLP, BiLSTM, BERT, DenseNet, and DINOv2, it finds that simpler models show no benefit from IHMs, while complex models benefit from ROS and RW, DMO is useful for unstructured data, and RUS/SMOTE degrade performance.

Significance. If the observed patterns hold, this provides actionable insights for biomedical ML practitioners on selecting imbalance handling strategies based on model complexity and data type. The use of multiple public datasets and diverse models strengthens the empirical basis, offering reproducible benchmarks that could guide future work in handling class imbalance in medical applications.

major comments (2)

[Datasets and Experimental Setup] The central claims regarding consistent benefits of ROS/RW for complex models and degradation by RUS/SMOTE rest on only three datasets without sensitivity checks on imbalance ratios or additional biomedical domains. This limits the ability to generalize the modality- and complexity-dependent effects beyond the specific characteristics of MIMIC-III, ADE, and MURA (e.g., their particular imbalance ratios and noise structures).
[Results] The abstract states that 'ROS and RW consistently enhanced the performance of powerful models' and 'RUS and SMOTE consistently degraded performance'; the Results section should report per-model statistical significance tests and effect sizes across all dataset-model pairs to substantiate the 'consistent' qualifier, as the current directional findings alone do not rule out setup-specific variation.

minor comments (2)

[Methods] Clarify the exact definition and implementation details of direct F1-score optimization (DMO) in the Methods section, including the loss formulation and any hyperparameters, to allow full reproducibility.
[Figures] Figure captions and axis labels in the performance comparison plots could be expanded to explicitly note the imbalance ratio for each dataset, improving immediate interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the empirical claims and clarify limitations.

read point-by-point responses

Referee: [Datasets and Experimental Setup] The central claims regarding consistent benefits of ROS/RW for complex models and degradation by RUS/SMOTE rest on only three datasets without sensitivity checks on imbalance ratios or additional biomedical domains. This limits the ability to generalize the modality- and complexity-dependent effects beyond the specific characteristics of MIMIC-III, ADE, and MURA (e.g., their particular imbalance ratios and noise structures).

Authors: We acknowledge that the use of three datasets limits the strength of generalization claims, even though the datasets were selected to span tabular, text, and image modalities common in biomedical applications. We agree that sensitivity analyses varying imbalance ratios would provide additional support. In the revised manuscript we will add a dedicated limitations subsection discussing the specific imbalance ratios and noise characteristics of the chosen datasets, include a brief sensitivity check on subsampled imbalance ratios for the tabular dataset where computationally feasible, and expand the discussion of how results may vary across other biomedical domains. Full experiments on additional datasets remain outside the scope of this revision due to data access and computational constraints. revision: partial
Referee: [Results] The abstract states that 'ROS and RW consistently enhanced the performance of powerful models' and 'RUS and SMOTE consistently degraded performance'; the Results section should report per-model statistical significance tests and effect sizes across all dataset-model pairs to substantiate the 'consistent' qualifier, as the current directional findings alone do not rule out setup-specific variation.

Authors: We agree that directional trends alone are insufficient to support the term 'consistently' and that statistical tests plus effect sizes are required. In the revised manuscript we will add per-model paired statistical significance tests (Wilcoxon signed-rank or t-tests as appropriate) together with effect sizes (Cohen's d) for every dataset-model pair in the Results section. We will also update the abstract to reflect the statistical findings rather than purely directional statements. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with direct performance measurements

full rationale

The paper performs a systematic empirical evaluation of five imbalance handling methods across three public biomedical datasets and multiple model architectures, reporting direct performance metrics on held-out test sets. No derivations, first-principles predictions, fitted parameters renamed as predictions, or self-citation chains are present; all results are obtained by training and evaluating models under controlled conditions. The central claims follow immediately from the observed F1-scores and other metrics without any reduction to the inputs by construction. This is a standard benchmarking study whose validity rests on experimental design rather than any self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms; the study rests on standard supervised learning assumptions and public benchmark datasets.

axioms (1)

standard math Standard i.i.d. train-test split assumption for supervised learning evaluation
Implicit in any benchmark comparison of classifiers.

pith-pipeline@v0.9.0 · 5596 in / 1158 out tokens · 43761 ms · 2026-05-15T04:47:03.226655+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Aftab, J. et al. Artificial intelligence based classification and prediction of medical imaging using a novel framework of inverted and self-attention deep neural network architecture. Sci. Rep. 15, 8724 (2025). 2. Ahsan, M. M., Luna, S. A. & Siddique, Z. Machine-Learning-Based Disease Diagnosis: A Comprehensive Review. Healthcare 10, 541 (2022). 3. Liu, ...

work page doi:10.1145/1273496.1273614 2025
[2]

& Sun, J

Peng, L., Travadi, Y., He, C., Cui, Y. & Sun, J. Exact Reformulation and Optimization for Direct Metric Optimization in Binary Imbalanced Classification. Preprint at https://doi.org/10.48550/arXiv.2507.15240 (2025). 16. Travadi, Y., Peng, L., Cui, Y. & Sun, J. Direct Metric Optimization for Imbalanced Classification. in 2023 IEEE 11th International Confer...

work page doi:10.48550/arxiv.2507.15240 2025
[3]

MiME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare

Saito, T. & Rehmsmeier, M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE 10, e0118432 (2015). 38. Choi, E., Xiao, C., Stewart, W. F. & Sun, J. MiME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare. Preprint at http://arxiv.org/abs/1810...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.findings-emnlp.187 2015