pith. sign in

arxiv: 2605.21528 · v1 · pith:M7NF5ZRKnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction

Pith reviewed 2026-05-22 01:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords AutoMLHealthcare Risk PredictionPipeline OptimizationReproducibilityClass ImbalanceLog-driven AnalysisComponent RedundancyInterpretability
0
0 comments X

The pith

A log-driven AutoML framework reveals that healthcare risk prediction pipelines depend on a small set of interacting components.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a deterministic framework that encodes every pipeline as a traceable log entity to support full reproducibility and component-level analysis. Experiments running more than 18,000 configurations on the Pima Indians Diabetes and Stroke datasets show the search space is structured and partially redundant, with performance driven mainly by augmentation, model choice, and imbalance handling. A sympathetic reader would care because this points to a practical way to simplify AutoML searches in medical settings where data are scarce and imbalanced. If correct, the work implies that exhaustive pipeline trials can be replaced by targeted focus on a few high-impact choices without large losses in performance.

Core claim

By treating each pipeline as a traceable log entity inside the yvsoucom-iterkit framework, analysis of over 18,000 configurations on the Pima Indians Diabetes and Stroke datasets shows a structured and partially redundant search space in which performance is governed by a small subset of interacting components; Random Forest importance ranks augmentation at 0.454 and model choice at 0.198 on Pima while imbalance handling reaches 0.406 on Stroke, and similarity metrics quantify redundancies such as biMax-biMean feature selection (RMS distance 0.0252) and mixup versus no augmentation (0.0279).

What carries the argument

The traceable log entity that records every pipeline configuration, enabling attribution of performance to individual components, measurement of their interactions and similarities, and assessment of cross-seed robustness.

If this is right

  • Effective AutoML optimization can concentrate on a reduced set of high-impact components instead of exploring the full space.
  • Ensemble models deliver stable high Weighted-F1 scores (0.89 on Pima, 0.94 on Stroke) with lower cross-seed variability than alternatives such as SVM.
  • Many component variants exhibit strong redundancy, including specific augmentation methods that perform nearly identically to no augmentation.
  • Macro-F1 remains limited on severely imbalanced data like Stroke even when Weighted-F1 is high.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar log-based redundancy analysis could be applied to other clinical prediction tasks to discover domain-specific shortcuts.
  • The framework's built-in traceability may help meet regulatory demands for documented model construction in healthcare.
  • The observed performance-robustness trade-off suggests ensembles as a default choice when reproducibility across random seeds matters.
  • Future experiments could deliberately prune the redundant components identified here and measure any drop in final accuracy.

Load-bearing premise

The two chosen datasets together with the 18,000 sampled configurations represent the broader space of healthcare risk prediction tasks well enough for the observed component importance rankings to generalize.

What would settle it

Re-running the identical log-driven framework on a new healthcare dataset such as heart-disease prediction and finding that the top-ranked components shift away from augmentation and imbalance handling.

Figures

Figures reproduced from arXiv: 2605.21528 by Lican Huang, Rui Huang.

Figure 1
Figure 1. Figure 1: System architecture of the proposed AutoML framework. The framework follows a centralized configuration and deterministic pipeline enumeration strategy, enabling parallel execution and reproducible large-scale experimentation. verification of pipeline behavior. By maintaining access to both intermediate and final outputs, the framework enables fine-grained inspection of model performance beyond aggre￾gate … view at source ↗
Figure 2
Figure 2. Figure 2: Random Forest-based component importance for the Pima dataset. Exact importance values are provided in the experimen￾tal logs. Random Forest importance reflects sensitivity of performance to component perturbations rather than direct performance improvement or degradation. Component-Specific Analysis. The unified heatmaps (Fig￾ure Supplementary C.3 and the Pima and Stroke Supplemen￾tary Materials) provide … view at source ↗
Figure 3
Figure 3. Figure 3: Detailed Random Forest importance analysis for the Pima dataset [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: pima unified metric mega heat map for branches 4.4.4. Value-Level Component Similarity Analysis Tables Supplementary D.7 summarize the value-level similarity analysis for the Pima dataset. Ranking value-level RMS similarities reveals structured redundancy patterns across pipeline components. Similarity-Based Insights. Feature Dimensionality: Con￾figurations with 4–8 selected features form a compact similar… view at source ↗
Figure 5
Figure 5. Figure 5: pima cluster10 part vs part heat map around 0.81, indicating relatively strong but more variable interactions compared to the Pima dataset. These analyses help identify groups of preprocessing, augmentation, normalization, and modeling components that tend to co-vary in their effects. By focusing on subsets of high￾impact or highly variable components, we reduce complexity while highlighting the most influ… view at source ↗
Figure 6
Figure 6. Figure 6: Pima cross-seed mean performance and standard deviation across random seeds. Performance and Stability Analysis Under Stochastic Variation. Figure Supplementary F.16 illustrates the joint distribution of mean Macro-F1 and standard deviation across models, highlighting the trade-off between predictive per￾formance and stability. SVM achieves the highest Macro-F1 (0.689) but also exhibits the highest variabi… view at source ↗
read the original abstract

Accurate and reproducible disease risk prediction remains challenging due to heterogeneous features, limited samples, and severe class imbalance. This study introduces yvsoucom-iterkit, a deterministic and log-driven automated machine learning framework that formulates pipeline optimization as a fully reproducible, configuration-level system. Each pipeline is encoded as a traceable log entity, enabling analysis of component attribution, interactions, similarity, and cross-seed robustness. Experiments on the Pima Indians Diabetes and Stroke datasets across more than 18,000 pipeline configurations reveal a structured and partially redundant search space, where performance is governed by a small subset of interacting components. Random Forest importance analysis identifies augmentation (0.454), model choice (0.198), and imbalance handling (0.101) as key drivers on Pima, while imbalance handling dominates Stroke (0.406). Component similarity analysis shows strong redundancy, with feature selection variants (biMax-biMean) exhibiting low RMS distance (0.0252), mixup closely matching no augmentation (0.0279), and TomekLinks aligning with no imbalance handling (0.0325), whereas Gaussian noise shows greater divergence from no augmentation (0.10). The framework achieves strong and stable performance using ensemble models (Weighted-F1 0.89, Macro-F1 0.88 on Pima; Weighted-F1 0.94 on Stroke), while Macro-F1 remains lower on Stroke (0.67) due to class imbalance. Cross-seed analysis reveals a performance-robustness trade-off, with ensembles showing lower variability (0.023-0.026) than SVM. These results indicate that effective AutoML optimization can focus on a reduced set of high-impact components.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces yvsoucom-iterkit, a deterministic log-driven AutoML framework that encodes each pipeline as a traceable log entity to support reproducible optimization and post-hoc interpretability analysis (component attribution, interactions, similarity, and cross-seed robustness). Experiments across more than 18,000 configurations on the Pima Indians Diabetes and Stroke datasets are used to argue that the pipeline search space is structured and partially redundant, with performance governed by a small subset of interacting components; Random Forest importance identifies augmentation (0.454) and model choice (0.198) as top drivers on Pima while imbalance handling (0.406) dominates on Stroke, and RMS-distance similarity analysis quantifies redundancies such as biMax-biMean (0.0252) and mixup vs. none (0.0279). Ensemble models achieve stable Weighted-F1 scores of 0.89–0.94.

Significance. If the central empirical claims hold, the work provides a concrete, log-centric methodology for making AutoML pipelines both reproducible and interpretable in healthcare settings, where the ability to trace and reduce the effective search space to a small set of high-impact components could be practically useful. The explicit quantification of component redundancy via RMS distances and the reporting of cross-seed variability constitute strengths that go beyond typical black-box AutoML results.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'performance is governed by a small subset of interacting components' rests on experiments performed on only two binary, imbalanced classification datasets whose top importance rankings already diverge (augmentation 0.454 on Pima vs. imbalance handling 0.406 on Stroke); this divergence indicates that the identified high-impact subset may be an artifact of the chosen data distributions rather than a general property of healthcare risk pipelines.
  2. [Abstract] Abstract: the manuscript provides no description of the sampling strategy used to generate the 18,000 pipeline configurations or of any post-hoc filtering; without this information the Random Forest importance scores and the RMS-distance redundancy results cannot be reliably interpreted as evidence of an intrinsically structured search space.
minor comments (2)
  1. [Abstract] Abstract: the framework name 'yvsoucom-iterkit' is introduced without explanation of its construction or intended meaning.
  2. [Abstract] Abstract: the reported cross-seed variability (0.023–0.026) should explicitly state the underlying performance metric (e.g., standard deviation of Weighted-F1 across seeds).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each major comment below, indicating where revisions will be made to improve clarity and completeness without altering the core claims or experimental results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'performance is governed by a small subset of interacting components' rests on experiments performed on only two binary, imbalanced classification datasets whose top importance rankings already diverge (augmentation 0.454 on Pima vs. imbalance handling 0.406 on Stroke); this divergence indicates that the identified high-impact subset may be an artifact of the chosen data distributions rather than a general property of healthcare risk pipelines.

    Authors: The manuscript already reports the dataset-specific rankings (augmentation highest on Pima, imbalance handling highest on Stroke) to illustrate that the governing components can vary while still forming a small subset. This supports the claim of a structured search space rather than contradicting it. We acknowledge the limitation of using only two datasets and will revise the abstract and add a paragraph in the discussion to explicitly state that the findings apply to the evaluated healthcare risk prediction tasks on these datasets. The framework itself is presented as a general tool for identifying such subsets per task. No new experiments are feasible at this stage, but the scope will be clarified. revision: partial

  2. Referee: [Abstract] Abstract: the manuscript provides no description of the sampling strategy used to generate the 18,000 pipeline configurations or of any post-hoc filtering; without this information the Random Forest importance scores and the RMS-distance redundancy results cannot be reliably interpreted as evidence of an intrinsically structured search space.

    Authors: We agree that a description of how the configurations were generated is necessary for interpreting the importance and redundancy analyses. We will add a dedicated subsection to the Methods section detailing the sampling strategy: the 18,000+ configurations were produced by systematically enumerating all valid combinations of the pipeline components (augmentation variants, imbalance handlers, feature selectors, and models) within the framework's defined search space, with no post-hoc filtering applied. This revision will allow readers to assess the results as evidence of structure in the explored space. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results from independent pipeline runs

full rationale

The paper introduces a logging framework for AutoML pipelines and then executes >18,000 configurations on two fixed datasets, recording performance metrics. Component importances and redundancy are computed post-hoc via Random Forest fits and RMS distances on those observed outcomes. No step reduces a claimed result to a fitted parameter by construction, nor does any self-citation or definitional loop carry the central claim about search-space structure. The findings rest on externally generated experimental data rather than on the framework's own definitions being re-used as both input and output. This is a standard reproducible empirical study whose derivation chain remains self-contained against the reported runs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions plus the modeling choice that exhaustive enumeration of 18,000 configurations adequately probes the space; no new physical entities are postulated.

free parameters (1)
  • Pipeline configuration choices
    Hyperparameters and component selections across augmentation, model, and imbalance modules are optimized over the enumerated space.
axioms (1)
  • domain assumption Random Forest feature importance reliably ranks the contribution of pipeline components to final performance
    Invoked when reporting augmentation (0.454) and other importance values on the Pima dataset.

pith-pipeline@v0.9.0 · 5843 in / 1409 out tokens · 52481 ms · 2026-05-22T01:28:02.358065+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Experiments on the Pima Indians Diabetes and Stroke datasets across more than 18,000 pipeline configurations reveal a structured and partially redundant search space, where performance is governed by a small subset of interacting components. Random Forest importance analysis identifies augmentation (0.454), model choice (0.198), and imbalance handling (0.101) as key drivers on Pima.

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Component similarity analysis shows strong redundancy, with feature selection variants (biMax-biMean) exhibiting low RMS distance (0.0252), mixup closely matching no augmentation (0.0279).

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

  1. [1]

    Gustavo E. A. P. A. Batista, Ronaldo C. Prati, and Maria Carolina Monard. A study of the behavior of several methods for balancing machinelearningtrainingdata.ACMSIGKDDexplorationsnewsletter, 6(1):20–29, 2004

  2. [2]

    Hyperparameter optimization: Foun- dations, algorithms, best practices, and open challenges.Wiley InterdisciplinaryReviews:DataMiningandKnowledgeDiscovery,13 (2):e1484, 2023

    Bernd Bischl, Martin Binder, Michel Lang, Tobias Pielok, Jakob Richter, Stefan Coors, Janek Thomas, Theresa Ullmann, Marc Becker, Anne-Laure Boulesteix, et al. Hyperparameter optimization: Foun- dations, algorithms, best practices, and open challenges.Wiley InterdisciplinaryReviews:DataMiningandKnowledgeDiscovery,13 (2):e1484, 2023. R. Huang et al.:Prepri...

  3. [3]

    Random forests.Machine Learning, 45(1):5–32, 2001

    Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001

  4. [4]

    A survey on feature selection methods.Computers & Electrical Engineering, 40(1):16–28, 2014

    Girish Chandrashekar and Ferat Sahin. A survey on feature selection methods.Computers & Electrical Engineering, 40(1):16–28, 2014. ISSN 0045-7906. doi: 10.1016/j.compeleceng.2013.11.024. URL https://www.sciencedirect.com/science/article/pii/S0045790613003

  5. [5]

    40th-year commemorative issue

  6. [6]

    Chawla, Kevin W

    Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357, 2002

  7. [7]

    Xgboost: A scalable tree boosting system.Proceedings of the 22nd ACM SIGKDD International ConferenceonKnowledgeDiscoveryandDataMining,pages785–794, 2016

    Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system.Proceedings of the 22nd ACM SIGKDD International ConferenceonKnowledgeDiscoveryandDataMining,pages785–794, 2016

  8. [8]

    AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

    Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pierre Larroy, Mu Li, and Alex Smola. Autogluon-tabular: Robust and accurate automl for structured data.arXiv preprint arXiv:2003.06505, 2020

  9. [9]

    Efficient and robust automated machine learning.Advances in neural information processing systems, 28, 2015

    Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springen- berg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning.Advances in neural information processing systems, 28, 2015

  10. [10]

    Auto-sklearn 2.0: Hands-free automl via meta-learning.Journal of Machine Learning Research, 23(261): 1–61, 2022

    Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. Auto-sklearn 2.0: Hands-free automl via meta-learning.Journal of Machine Learning Research, 23(261): 1–61, 2022

  11. [11]

    Why do tree-based models still outperform deep learning on typical tabular data?Advancesinneuralinformationprocessingsystems,35:507–520, 2022

    Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data?Advancesinneuralinformationprocessingsystems,35:507–520, 2022

  12. [12]

    Springer Berlin Heidelberg, Berlin, Heidelberg, 2006

    Isabelle Guyon and André Elisseeff.An Introduction to Feature Ex- traction, pages 1–25. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006

  13. [13]

    Transparency and reproducibility in artificial intelligence.Nature, 586(7829):E14– E16, 2020

    Benjamin Haibe-Kains, George Alexandru Adam, et al. Transparency and reproducibility in artificial intelligence.Nature, 586(7829):E14– E16, 2020

  14. [14]

    Adasyn: Adaptive synthetic sampling approach for imbalanced learning

    Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE worldcongressoncomputationalintelligence),pages1322–1328.Ieee, 2008

  15. [15]

    Knowledge- based systems212, 106622 (2021)

    Xin He, Kaiyong Zhao, and Xiaowen Chu. Automl: A survey of the state-of-the-art.Knowledge-Based Systems, 212:106622, 2021. ISSN 0950-7051. doi: https://doi.org/10.1016/j.knosys.2020.106622. URL https://www.sciencedirect.com/science/article/pii/S0950705120307 516

  16. [16]

    Anautomlsystemforimprovingdiabetes predictionbyauto-optimizationofpreprocessingandmachinelearning models

    RuiHuangandLicanHuang. Anautomlsystemforimprovingdiabetes predictionbyauto-optimizationofpreprocessingandmachinelearning models. SSRN preprint, 2026. Available athttps://ssrn.com/abstrac t=5898268orhttp://dx.doi.org/10.2139/ssrn.5898268

  17. [17]

    Springer Nature, 2019

    Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren.Automated machine learning: methods, systems, challenges. Springer Nature, 2019

  18. [18]

    H2o automl: Scalable automatic machine learning

    Erin LeDell and Sebastien Poirier. H2o automl: Scalable automatic machine learning. InProceedings of the AutoML Workshop at ICML, 2020

  19. [19]

    Feature selection: A data perspective.ACM computing surveys (CSUR), 50(6):1–45, 2017

    Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino, Jiliang Tang, and Huan Liu. Feature selection: A data perspective.ACM computing surveys (CSUR), 50(6):1–45, 2017

  20. [20]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. InProceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

  21. [21]

    Automatingbiomedical data science through tree-based pipeline optimization

    Randal S Olson, Ryan J Urbanowicz, Peter C Andrews, Nicole A Lavender,LaCreisKidd,andJasonHMoore. Automatingbiomedical data science through tree-based pipeline optimization. InEuropean conference on the applications of evolutionary computation, pages 123–137. Springer, 2016

  22. [22]

    Abhilash Pati, Manoranjan Parhi, and Binod Kumar Pattanayak. A review on prediction of diabetes using machine learning and data mining classification techniques.International Journal of Biomedical Engineering and Technology, 41(1):83–109, 2023

  23. [23]

    Performance analysis of naive bayes and j48 classification algorithm for data classification

    Tina R Patil and Swati Sunil Sherekar. Performance analysis of naive bayes and j48 classification algorithm for data classification. International journal of computer science and applications, 6(2):256– 261, 2013

  24. [24]

    Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program)

    Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program). Journal of machine learning research, 22(164):1–20, 2021

  25. [25]

    A review of feature selection techniques in bioinformatics.bioinformatics, 23(19):2507– 2517, 2007

    Yvan Saeys, Inaki Inza, and Pedro Larranaga. A review of feature selection techniques in bioinformatics.bioinformatics, 23(19):2507– 2517, 2007

  26. [26]

    Diagnosis of diabetes type-ii using hybrid machine learning based ensemble model.International Journal of Information Technology, 12(2):419–428, 2020

    Abid Sarwar, Mehbob Ali, Jatinder Manhas, and Vinod Sharma. Diagnosis of diabetes type-ii using hybrid machine learning based ensemble model.International Journal of Information Technology, 12(2):419–428, 2020

  27. [27]

    Benjamin Shickel, Patrick James Tighe, Azra Bihorac, and Parisa Rashidi. Deep ehr: a survey of recent advances in deep learning techniques for electronic health record (ehr) analysis.IEEE journal of biomedical and health informatics, 22(5):1589–1604, 2017

  28. [28]

    M. S. Singh, K. Thongam, P. Choudhary, et al. Stroke risk prediction and prevention: Traditional versus machine learning approaches. Archives of Computational Methods in Engineering, 2025. doi: 10.1007/s11831-025-10406-5

  29. [29]

    Auto-weka: Combined selection and hyperparameter opti- mization of classification algorithms

    Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton- Brown. Auto-weka: Combined selection and hyperparameter opti- mization of classification algorithms. InProceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 847–855, 2013

  30. [30]

    Twomodificationsofcnn.IEEETransactionsonSystems, Man, and Cybernetics, SMC-6(11):769–772, 1976

    IvanTomek. Twomodificationsofcnn.IEEETransactionsonSystems, Man, and Cybernetics, SMC-6(11):769–772, 1976

  31. [31]

    mixup: Beyond empirical risk minimization

    HongyiZhang,MoustaphaCisse,YannN.Dauphin,andDavidLopez- Paz. mixup: Beyond empirical risk minimization. InInternational Conference on Learning Representations (ICLR), pages 1–13, 2018

  32. [32]

    Marc-André Zöller and Marco F. Huber. Benchmark and survey of automated machine learning frameworks.Journal of Artificial Intelligence Research, 70:409–472, 2021. R. Huang et al.:Preprint submitted to ElsevierPage 20 of 41 Log-Driven Reproducible AutoML for Healthcare Supplementary Materials Overview This document presents thecomprehensive set of experime...