pith. sign in

arxiv: 2605.21507 · v1 · pith:3OVMVZ52new · submitted 2026-05-09 · ⚛️ physics.ao-ph · cs.AI· cs.CE· cs.LG

Visibility nowcasting in South Korea: a machine learning approach to class imbalance and distribution shift

Pith reviewed 2026-05-22 01:58 UTC · model grok-4.3

classification ⚛️ physics.ao-ph cs.AIcs.CEcs.LG
keywords visibility nowcastingmachine learningclass imbalancedistributional shiftWasserstein distanceSHAP analysisSouth Koreaatmospheric visibility
0
0 comments X

The pith

Visibility nowcasts in South Korean cities lose accuracy on new data because of shifts in meteorological and pollutant distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper sets out a machine learning framework for nowcasting atmospheric visibility across six major South Korean cities, where low-visibility events are rare and weather-pollution patterns evolve from year to year. The authors balance the 2018-2020 training records with SMOTENC and CTGAN, then combine machine-learning and deep-learning models into an ensemble before testing on 2021 data. They find that cross-validation scores do not hold up on the later period and trace the drop to a change in the underlying data distribution, shown by the Wasserstein distance on the single most important input variable according to SHAP values. A sympathetic reader cares because visibility predictions directly affect road safety and air-quality alerts, yet models built on past conditions can silently degrade when the environment itself shifts.

Core claim

The central claim is that an ensemble of machine learning and deep learning models, after SMOTENC and CTGAN are used to correct class imbalance in the scarce low-visibility cases, achieves strong results during cross-validation on 2018-2020 data yet shows a clear drop in predictive performance when applied to the 2021 test set. The authors attribute this degradation to a distributional shift between the training and test periods and support the attribution by computing the Wasserstein distance on the feature that SHAP analysis ranks as most influential.

What carries the argument

The Wasserstein distance computed on the single highest-SHAP-importance feature, used to quantify and confirm the distributional shift between the 2018-2020 training window and the 2021 test window.

If this is right

  • Nowcasting systems for visibility must detect and adapt to year-to-year changes in the joint distribution of meteorological and air-pollutant variables.
  • Cross-validation scores on historical data cannot be taken as reliable indicators of future operational performance.
  • Operational visibility models require ongoing monitoring of input-feature distributions to maintain usefulness over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Periodic retraining or online adaptation of the ensemble may be needed to keep pace with evolving environmental conditions.
  • The same imbalance-plus-shift problem is likely to appear in other time-series environmental forecasts such as air-quality or precipitation nowcasting.
  • Testing the framework on additional future years would reveal whether the performance decline accelerates or stabilizes.

Load-bearing premise

The performance drop on the 2021 test set stems mainly from a change in data distribution rather than from model overfitting, alterations in measurement methods, or other unaccounted variables, and that the Wasserstein distance on one SHAP-selected feature is enough to establish this cause.

What would settle it

A finding that the Wasserstein distance on the top SHAP feature is small yet predictive skill on 2021 data remains low, or that retraining on data that includes periods closer to 2021 restores skill without any change to the shift measure, would undermine the claim that distributional shift is the primary driver.

read the original abstract

Atmospheric visibility is a critical variable for transportation safety and air quality management, however, accurate prediction remains challenging due to the complex interactions between meteorological conditions and air pollutants, as well as the rarity of low-visibility events. This study introduces a machine learning framework to nowcast visibility in six major South Korean cities. To handle the imbalance in the 2018-2020 training data, we applied the Synthetic Minority Over-sampling Technique with Nominal and Continuous (SMOTENC) and Conditional Tabular Generative Adversarial Network (CTGAN). An ensemble approach combining machine learning and deep learning models was then used and evaluated on a 2021 test dataset. The results revealed a marked decline in predictive performance in the test set compared to the cross-validation phase. This degradation was attributed to a distributional shift between training and testing periods, which was quantitatively confirmed by measuring the Wasserstein distance of the most influential feature identified by SHAP analysis. In general, this study presents a methodology that aims to simultaneously address the dual challenges of data imbalance and temporal distributional shifts, and emphasizes the necessity of accounting for evolving external environmental factors when implementing nowcasting models on time-series data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a machine learning framework for nowcasting visibility in six major South Korean cities using 2018-2020 training data. It applies SMOTENC and CTGAN to address class imbalance for rare low-visibility events, combines machine learning and deep learning models in an ensemble, and evaluates on a 2021 test set. The marked decline in performance relative to cross-validation is attributed to temporal distributional shift, with quantitative support from the Wasserstein distance computed on the single most influential feature identified by post-hoc SHAP analysis.

Significance. If the attribution to distributional shift is substantiated, the work provides a concrete example of handling both class imbalance and temporal shifts in environmental nowcasting, which is relevant for transportation safety and air quality applications. The choice of an independent metric (Wasserstein distance) on a data-driven feature offers a step toward falsifiable explanations in applied ML for atmospheric time series, though the current evidence is limited in scope.

major comments (2)
  1. Abstract: The central claim that the observed performance decline on the 2021 test set is primarily caused by distributional shift rests on Wasserstein distance computed only for the single most influential feature from SHAP analysis. This does not establish the shift as the dominant cause without reporting divergence metrics across the full feature set or joint distributions, nor controlled comparisons isolating the shift from alternatives such as overfitting to 2018-2020 patterns or unmodeled changes in pollutant measurement protocols.
  2. Abstract: No specific performance metrics (e.g., precision, recall, F1, or AUC with error bars), exact ensemble architectures, hyperparameter details, or the procedure for post-hoc SHAP feature selection are reported, which prevents assessment of whether the decline magnitude is consistent with the claimed shift or with other factors.
minor comments (1)
  1. Abstract: The sentence beginning 'This degradation was attributed...' would benefit from a brief parenthetical note on the exact feature used for the Wasserstein calculation to improve immediate readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We are grateful to the referee for providing a thorough review of our manuscript. The comments have prompted us to clarify several aspects of our methodology and results. Below, we respond to each major comment in turn.

read point-by-point responses
  1. Referee: Abstract: The central claim that the observed performance decline on the 2021 test set is primarily caused by distributional shift rests on Wasserstein distance computed only for the single most influential feature from SHAP analysis. This does not establish the shift as the dominant cause without reporting divergence metrics across the full feature set or joint distributions, nor controlled comparisons isolating the shift from alternatives such as overfitting to 2018-2020 patterns or unmodeled changes in pollutant measurement protocols.

    Authors: We acknowledge the validity of this concern. Our attribution to distributional shift is based on the most influential feature per SHAP, which we chose as a focused, interpretable approach. To strengthen this, we will add Wasserstein distance calculations for additional top features from the SHAP analysis in the revised manuscript. We will also discuss potential confounding factors like overfitting and measurement changes. However, a comprehensive set of controlled comparisons to definitively isolate the shift is not feasible within the current study scope and will be listed as a limitation. revision: partial

  2. Referee: Abstract: No specific performance metrics (e.g., precision, recall, F1, or AUC with error bars), exact ensemble architectures, hyperparameter details, or the procedure for post-hoc SHAP feature selection are reported, which prevents assessment of whether the decline magnitude is consistent with the claimed shift or with other factors.

    Authors: We agree that the abstract should provide more quantitative context. In the revision, we will incorporate specific performance metrics including precision, recall, F1, and AUC with error bars for both cross-validation and the 2021 test set. We will also briefly describe the ensemble architecture (an ensemble of tree-based models and deep learning models), key hyperparameters, and the post-hoc SHAP feature selection procedure. These details are elaborated in the methods and results sections, but summarizing them in the abstract will improve accessibility. revision: yes

standing simulated objections not resolved
  • Conducting controlled comparisons to fully isolate distributional shift from alternatives like overfitting or changes in measurement protocols.

Circularity Check

0 steps flagged

No significant circularity: empirical attribution relies on independent data metric

full rationale

The paper performs a standard temporal train-test split on real meteorological and pollutant data (2018-2020 training, 2021 testing), directly measures predictive performance drop on the held-out set, and attributes it to distributional shift via Wasserstein distance computed on the single highest-SHAP-importance feature. This metric is an external statistical comparison of observed data distributions and does not reduce to any fitted model parameter, self-definition, or self-citation chain by construction. The ensemble, SMOTENC, and CTGAN steps address imbalance separately from the shift diagnosis. No load-bearing step equates a claimed result to its own inputs; the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard ML assumptions plus domain choices for handling imbalance and attributing shift; no new physical entities are postulated.

free parameters (2)
  • Ensemble model selection and hyperparameters
    Specific models combined in the ensemble and their tuning parameters are chosen to optimize performance on the training data.
  • Definition of low-visibility class threshold
    The boundary separating rare low-visibility events from the majority class is implicitly set to create the imbalance problem addressed by oversampling.
axioms (1)
  • domain assumption The 2018-2020 training data distribution is sufficiently stationary within the period to allow effective model training despite known temporal variability in environmental data.
    The paper trains on this fixed window and evaluates generalization to 2021 without adaptive mechanisms.

pith-pipeline@v0.9.0 · 5748 in / 1733 out tokens · 59756 ms · 2026-05-22T01:58:58.439587+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 2 internal anchors

  1. [1]

    Environmental Research159, 466–473 (2017) https://doi.org/10.1016/j.envres.2017.08.018

    Hu, Y., Yao, L., Cheng, Z., Wang, Y.: Long-term atmospheric visibility trends in megacities of china, india and the united states. Environmental Research159, 466–473 (2017) https://doi.org/10.1016/j.envres.2017.08.018

  2. [2]

    Advances in Atmospheric Sciences36(10), 1060–1077 (2019) https://doi.org/10.1007/s00376-019-8252-5

    Qian, W., Leung, J.C.-H., Chen, Y., Huang, S.: Applying anomaly-based weather analysis to the prediction of low visibility associated with the coastal fog at ningbo-zhoushan port in east china. Advances in Atmospheric Sciences36(10), 1060–1077 (2019) https://doi.org/10.1007/s00376-019-8252-5

  3. [3]

    International Journal of Forecasting39(2), 992–1004 (2023) https://doi.org/10.1016/j.ijforecast.2022

    Ortega, L.C., Otero, L.D., Solomon, M., Otero, C.E., Fabregas, A.: Deep learning models for visibility forecasting using climatological data. International Journal of Forecasting39(2), 992–1004 (2023) https://doi.org/10.1016/j.ijforecast.2022. 03.009

  4. [4]

    IEEE Access12, 72530–72543 (2024) https://doi.org/10.1109/ACCESS.2024.3401091

    Raj, S., Deo, R.C., Sharma, E., Prasad, R., Dinh, T., Salcedo-Sanz, S.: Atmo- spheric visibility and cloud ceiling predictions with hybrid iis-lstm integrated model: Case studies for fiji’s aviation industry. IEEE Access12, 72530–72543 (2024) https://doi.org/10.1109/ACCESS.2024.3401091

  5. [5]

    In: 2019 IEEE International Systems Conference (SysCon), pp

    Ortega, L., Otero, L.D., Otero, C.: Application of machine learning algorithms for visibility classification. In: 2019 IEEE International Systems Conference (SysCon), pp. 1–5 (2019). https://doi.org/10.1109/SYSCON.2019.8836910

  6. [6]

    Weather and Climate Extremes28, 100243 (2020) https://doi.org/10.1016/j.wace.2020.100243

    Taszarek, M., Kendzierski, S., Pilguj, N.: Hazardous weather affecting european airports: Climatological estimates of situations with limited visibility, thun- derstorm, low-level wind shear and snowfall from era5. Weather and Climate Extremes28, 100243 (2020) https://doi.org/10.1016/j.wace.2020.100243

  7. [7]

    International Journal of Transportation Science and Technology9(4), 287–298 (2020) https://doi.org/10.1016/j.ijtst.2020.02.001

    Zhai, B., Lu, J., Wang, Y., Wu, B.: Real-time prediction of crash risk on free- ways under fog conditions. International Journal of Transportation Science and Technology9(4), 287–298 (2020) https://doi.org/10.1016/j.ijtst.2020.02.001

  8. [8]

    Journal of Navigation77(4), 436–456 (2024) https://doi.org/10.1017/S0373463324000377

    Ding, G., Li, R., Li, C., Yang, B., Li, Y., Yu, Q., Geng, X., Yao, Z., Zhang, K., Wen, J.: Review of ship navigation safety in fog. Journal of Navigation77(4), 436–456 (2024) https://doi.org/10.1017/S0373463324000377

  9. [9]

    Journal of the Korean Meteorological Society29, 439–450 (2019) https://doi.org/10.14191/Atmos.2019.29.4.439

    Lee, Y.-S., Reno, K.-Y., Choi, R., Kim, K.-H., Park, S.-H., Nam, H.-J., Kim, S.- B.: Improvement of automatic present weather observation with in situ visibility and humidity measurements. Journal of the Korean Meteorological Society29, 439–450 (2019) https://doi.org/10.14191/Atmos.2019.29.4.439

  10. [10]

    Remote Sensing13(11) (2021) https://doi.org/10.3390/rs13112096 30

    Yu, Z., Qu, Y., Wang, Y., Ma, J., Cao, Y.: Application of machine-learning- based fusion model in visibility forecast: A case study of shanghai, china. Remote Sensing13(11) (2021) https://doi.org/10.3390/rs13112096 30

  11. [11]

    Weather and Forecasting37(12), 2263–2274 (2022) https://doi.org/10.1175/ WAF-D-22-0053.1

    Kim, B.-Y., Belorid, M., Cha, J.W.: Short-term visibility prediction using tree- based machine learning algorithms and numerical weather prediction data. Weather and Forecasting37(12), 2263–2274 (2022) https://doi.org/10.1175/ WAF-D-22-0053.1

  12. [12]

    IET Confer- ence Proceedings2024, 221–226 (2025) https://doi.org/10.1049/icp.2025.0028 https://digital-library.theiet.org/doi/pdf/10.1049/icp.2025.0028

    Zhou, B., Yin, Y., Zang, Z., Niu, D., Gao, H., Fu, X.: An effective atmo- spheric visibility forecasting model based on improved rainformer. IET Confer- ence Proceedings2024, 221–226 (2025) https://doi.org/10.1049/icp.2025.0028 https://digital-library.theiet.org/doi/pdf/10.1049/icp.2025.0028

  13. [13]

    Chantry, M., Christensen, H., Dueben, P., Palmer, T.: Opportunities and chal- lenges for machine learning in weather and climate modelling: hard, medium and soft ai. Philosophical Transactions of the Royal Society A: Mathemati- cal, Physical and Engineering Sciences379(2194), 20200083 (2021) https:// doi.org/10.1098/rsta.2020.0083 https://royalsocietypubl...

  14. [14]

    Archives of Computational Methods in Engineering29(2), 1247–1275 (2022) https://doi.org/10.1007/ s11831-021-09616-4

    Fathi, M., Kashani, M.H., Jameii, S.M., Mahdipour, E.: Big data analyt- ics in weather forecasting: A systematic review. Archives of Computational Methods in Engineering29(2), 1247–1275 (2022) https://doi.org/10.1007/ s11831-021-09616-4

  15. [15]

    Applied Sciences9(22) (2019) https://doi.org/10.3390/ app9224931

    Aguasca-Colomo, R., Castellanos-Nieves, D., M´ endez, M.: Comparative analysis of rainfall prediction models using machine learning in islands with complex orog- raphy: Tenerife island. Applied Sciences9(22) (2019) https://doi.org/10.3390/ app9224931

  16. [16]

    SMOTE: Synthetic Minority Over-sampling Technique

    Bowyer, K.W., Chawla, N.V., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. CoRRabs/1106.1813(2011) 1106.1813

  17. [17]

    In: Wallach, H., Larochelle, H., Beygelz- imer, A., Alch´ e-Buc, F., Fox, E., Garnett, R

    Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional gan. In: Wallach, H., Larochelle, H., Beygelz- imer, A., Alch´ e-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., Vancou- ver, Canada (2019). https://proceedings.neurips.cc/paper ...

  18. [18]

    Wasserstein GAN

    Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN (2017). https://arxiv. org/abs/1701.07875

  19. [19]

    Atmospheric Environment42(7), 1424–1435 (2008) https://doi

    Deng, X., Tie, X., Wu, D., Zhou, X., Bi, X., Tan, H., Li, F., Jiang, C.: Long- term trend of visibility and its characterizations in the pearl river delta (prd) region, china. Atmospheric Environment42(7), 1424–1435 (2008) https://doi. org/10.1016/j.atmosenv.2007.11.025

  20. [21]

    Advances in Meteorology2020(1), 8899750 (2020) https://doi.org/10.1155/2020/8899750 https://onlinelibrary.wiley.com/doi/pdf/10.1155/2020/8899750

    Zhang, J., Zhao, P., Wang, X., Zhang, J., Liu, J., Li, B., Zhou, Y., Wang, H.: Main factors influencing winter visibility at the xinjin flight college of the civil aviation flight university of china. Advances in Meteorology2020(1), 8899750 (2020) https://doi.org/10.1155/2020/8899750 https://onlinelibrary.wiley.com/doi/pdf/10.1155/2020/8899750

  21. [22]

    Chen and C

    Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16, pp. 785–794. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2939672.2939785

  22. [23]

    In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R

    Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.- Y.: Lightgbm: A highly efficient gradient boosting decision tree. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., Long Beach, CA, USA (2017)....

  23. [24]

    Deep residual learning for image recognition,

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

  24. [25]

    In: Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W

    Gorishniy, Y., Rubachev, I., Khrulkov, V., Babenko, A.: Revisiting deep learning models for tabular data. In: Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems (2021). https: //openreview.net/forum?id=i Q1yrOegLY

  25. [26]

    In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Min- ing

    Ke, G., Xu, Z., Zhang, J., Bian, J., Liu, T.-Y.: Deepgbm: A deep learning frame- work distilled by gbdt for online prediction tasks. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Min- ing. KDD ’19, pp. 384–394. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3292500.333...

  26. [27]

    Weather and Forecasting5(4), 570–575 (1990) https://doi.org/10.1175/1520-0434(1990) 005⟨0570:TCSIAA⟩2.0.CO;2

    Schaefer, J.T.: The critical success index as an indicator of warning skill. Weather and Forecasting5(4), 570–575 (1990) https://doi.org/10.1175/1520-0434(1990) 005⟨0570:TCSIAA⟩2.0.CO;2

  27. [28]

    UNP Journal of Statistics and Data Science1(3), 120–125 (2023) https: //doi.org/10.24036/ujsds/vol1-iss3/39 32

    Nabilla, V.H., Fitria, D., Permana, D., Fitri, F.: Comparison of haversine and euclidean distance formulas for calculating distance between regencies in west sumatra. UNP Journal of Statistics and Data Science1(3), 120–125 (2023) https: //doi.org/10.24036/ujsds/vol1-iss3/39 32

  28. [29]

    Environment International169, 107538 (2022) https://doi.org/10.1016/j.envint.2022.107538

    Xu, C., Wang, J., Hu, M., Wang, W.: A new method for interpolation of miss- ing air quality data at monitor stations. Environment International169, 107538 (2022) https://doi.org/10.1016/j.envint.2022.107538

  29. [30]

    PLOS ONE19(9), 1–39 (2024) https://doi.org/10.1371/journal.pone.0306303

    Hua, V., Nguyen, T., Dao, M.-S., Nguyen, H.D., Nguyen, B.T.: The impact of data imputation on air quality prediction problem. PLOS ONE19(9), 1–39 (2024) https://doi.org/10.1371/journal.pone.0306303

  30. [31]

    Environmental Science and Pollution Research International30(28), 72319–72335 (2023) https: //doi.org/10.1007/s11356-023-27176-x

    Parra-Plazas, J., Gaona-Garcia, P., Plazas-Nossa, L.: Time series outlier removal and imputing methods based on colombian weather stations data. Environmental Science and Pollution Research International30(28), 72319–72335 (2023) https: //doi.org/10.1007/s11356-023-27176-x

  31. [32]

    Engineering Applications of Artificial Intelligence162, 112780 (2025) https://doi.org/10.1016/j.engappai.2025.112780

    Porcelli, L., Fiore, U., Palmieri, F.: Generative models with helical time encod- ing for seasonal time series forecasting. Engineering Applications of Artificial Intelligence162, 112780 (2025) https://doi.org/10.1016/j.engappai.2025.112780

  32. [33]

    Air2(4), 444–467 (2024) https://doi.org/10.3390/ air2040026

    Calastrini, F., Messeri, G., Orlandi, A.: Long-range mineral dust transport events in mediterranean countries. Air2(4), 444–467 (2024) https://doi.org/10.3390/ air2040026

  33. [34]

    Journal of Fundamental and Applied Sciences10, 1256–1267 (2018) https://doi.org/10

    Haris, N.A., Azlan, A., Nor, N.M., Sharif, N.A.M.: Improving air pollution index (api) predictive accuracy using time series cross-validation technique. Journal of Fundamental and Applied Sciences10, 1256–1267 (2018) https://doi.org/10. 4314/jfas.v10i1s.93

  34. [35]

    https://arxiv.org/abs/2511.11945

    Temraz, M., Keane, M.T.: Augmenting The Weather: A Hybrid Counterfactual- SMOTE Algorithm for Improving Crop Growth Prediction When Climate Changes (2025). https://arxiv.org/abs/2511.11945

  35. [36]

    IEEE Access10, 30655–30665 (2022) https://doi.org/10

    Sharma, A., Singh, P.K., Chandra, R.: Smotified-gan for class imbalanced pattern classification problems. IEEE Access10, 30655–30665 (2022) https://doi.org/10. 1109/ACCESS.2022.3158977

  36. [37]

    Information Sciences with Applications5, 1–10 (2025) https://doi.org/10.61356/j.iswa.2025.5466

    Abdullah, W., Bacanin, N., Venkatachalam, K.: Ensemble rf-knn model for accu- rate prediction of drought levels. Information Sciences with Applications5, 1–10 (2025) https://doi.org/10.61356/j.iswa.2025.5466

  37. [38]

    Neurocomputing149, 275– 284 (2015) https://doi.org/10.1016/j.neucom.2014.02.072

    Cao, J., Kwong, S., Wang, R., Li, X., Li, K., Kong, X.: Class-specific soft voting based multiple extreme learning machines ensemble. Neurocomputing149, 275– 284 (2015) https://doi.org/10.1016/j.neucom.2014.02.072 . Advances in neural networks Advances in Extreme Learning Machines

  38. [39]

    Vietnam Journal of Computer Science11(04), 531–552 (2024) https://doi.org/ 10.1142/S2196888824500155 https://doi.org/10.1142/S2196888824500155 33

    Cao-Van, K., Minh, T.C., Minh, L.G., Quyen, T.T.B., Tan, H.M.: Soft-voting ensemble model: An efficient learning approach for predictive prostate cancer risk. Vietnam Journal of Computer Science11(04), 531–552 (2024) https://doi.org/ 10.1142/S2196888824500155 https://doi.org/10.1142/S2196888824500155 33

  39. [40]

    Journal of Network and Computer Applications212, 103560 (2023) https://doi.org/10.1016/j.jnca.2022.103560

    Khan, M.A., Iqbal, N., Imran, Jamil, H., Kim, D.-H.: An optimized ensemble prediction model using automl based on soft voting classifier for network intrusion detection. Journal of Network and Computer Applications212, 103560 (2023) https://doi.org/10.1016/j.jnca.2022.103560

  40. [41]

    Sensors22(19) (2022) https://doi.org/10.3390/s22197268

    Kibria, H.B., Nahiduzzaman, M., Goni, M.O.F., Ahsan, M., Haider, J.: An ensem- ble approach for the prediction of diabetes mellitus using a soft voting classifier with an explainable ai. Sensors22(19) (2022) https://doi.org/10.3390/s22197268

  41. [42]

    Applied Sciences12(15) (2022) https://doi.org/10.3390/app12157554

    Manconi, A., Armano, G., Gnocchi, M., Milanesi, L.: A soft-voting ensemble clas- sifier for detecting patients affected by covid-19. Applied Sciences12(15) (2022) https://doi.org/10.3390/app12157554

  42. [43]

    Symmetry17(2) (2025) https://doi.org/ 10.3390/sym17020185

    Sultan, S.Q., Javaid, N., Alrajeh, N., Aslam, M.: Machine learning-based stacking ensemble model for prediction of heart disease with explainable ai and k-fold cross-validation: A symmetric approach. Symmetry17(2) (2025) https://doi.org/ 10.3390/sym17020185

  43. [44]

    IEEE Transactions on Energy Conversion40(1), 557–567 (2025) https://doi.org/10.1109/TEC.2024.3420394

    Rammurti Sharma, N., Rameshchandra Bhalja, B., Malik, O.P.: Machine learning-based severity assessment and incipient turn-to-turn fault detection in induction motors. IEEE Transactions on Energy Conversion40(1), 557–567 (2025) https://doi.org/10.1109/TEC.2024.3420394

  44. [45]

    Technologies13(3) (2025) https://doi.org/10.3390/ technologies13030088

    Imani, M., Beikmohammadi, A., Arabnia, H.R.: Comprehensive analysis of random forest and xgboost performance with smote, adasyn, and gnus under varying imbalance levels. Technologies13(3) (2025) https://doi.org/10.3390/ technologies13030088

  45. [46]

    Journal of the American Medi- cal Informatics Association31(11), 2529–2539 (2024) https://doi

    Tian, M., Chen, B., Guo, A., Jiang, S., Zhang, A.R.: Reliable generation of privacy-preserving synthetic electronic health record time series via diffusion models. Journal of the American Medi- cal Informatics Association31(11), 2529–2539 (2024) https://doi. org/10.1093/jamia/ocae229 https://academic.oup.com/jamia/article- pdf/31/11/2529/59813606/ocae229.pdf

  46. [47]

    Aerosol and Air Quality Research, 1048–1061 (2020) https://doi.org/10

    Won, W.-S., Oh, R., Lee, W., Kim, K.-Y., Ku, S., Su, P.-C., Yoon, Y.-J.: Impact of fine particulate matter on visibility at incheon international airport, south korea. Aerosol and Air Quality Research, 1048–1061 (2020) https://doi.org/10. 4209/aaqr.2019.03.0106

  47. [48]

    Atmosphere11(5) (2020) https://doi.org/ 10.3390/atmos11050461

    Sun, X., Zhao, T., Liu, D., Gong, S., Xu, J., Ma, X.: Quantifying the influences of pm2.5 and relative humidity on change of atmospheric visibility over recent winters in an urban area of east china. Atmosphere11(5) (2020) https://doi.org/ 10.3390/atmos11050461

  48. [49]

    Masset, R

    Sfar, W., Amhaimar, L., Khalidi, A., Talbi, B.: A hybrid long-term photovoltaic power prediction model integrating a bilstm network with residual correction via 34 catboost. Results in Engineering29, 108898 (2026) https://doi.org/10.1016/j. rineng.2025.108898

  49. [50]

    doi: 10.24963/ijcai.2022/

    Rozemberczki, B., Watson, L., Bayer, P., Yang, H.-T., Kiss, O., Nilsson, S., Sarkar, R.: The shapley value in machine learning. In: Raedt, L.D. (ed.) Proceed- ings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 5572–5579. International Joint Conferences on Artificial Intelli- gence Organization, Vienna, Austri...

  50. [51]

    https://arxiv.org/abs/2505.03992

    Briscoe, J., Kepler, G., Deford, D., Gebremedhin, A.: Algorithmic Accountability in Small Data: Sample-Size-Induced Bias Within Classification Metrics (2025). https://arxiv.org/abs/2505.03992

  51. [52]

    In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

    Francazi, E., Baity-Jesi, M., Lucchi, A.: A theoretical analysis of the learn- ing dynamics under class imbalance. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th Inter- national Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp. 10285–10322. PMLR, Honolul...

  52. [53]

    Journal of Advances in Modeling Earth Sys- tems15(12), 2023–003792 (2023) https://doi.org/10.1029/2023MS003792 https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2023MS003792

    Smith, T.A., Penny, S.G., Platt, J.A., Chen, T.-C.: Temporal subsam- pling diminishes small spatial scales in recurrent neural network emulators of geophysical turbulence. Journal of Advances in Modeling Earth Sys- tems15(12), 2023–003792 (2023) https://doi.org/10.1029/2023MS003792 https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2023MS003792. e202...