pith. machine review for the scientific record. sign in

arxiv: 2604.05225 · v1 · submitted 2026-04-06 · 📊 stat.CO · cs.LG· stat.AP· stat.ML

Recognition: no theorem link

fastml: Guarded Resampling Workflows for Safer Automated Machine Learning in R

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:37 UTC · model grok-4.3

classification 📊 stat.CO cs.LGstat.APstat.ML
keywords preprocessing leakageguarded resamplingautomated machine learningR packagedata leakage preventionresampling workflowssurvival analysis
0
0 comments X

The pith

Guarded resampling re-estimates preprocessing inside each fold to stop leakage from inflating machine learning performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents fastml, an R package that supplies a single-call workflow for automated machine learning while enforcing guarded resampling. In guarded resampling, data-dependent steps such as scaling or imputation are estimated only on the training portion of each resample and then applied to the held-out portion. A Monte Carlo simulation shows that performing these steps once on the full dataset before any splitting produces markedly higher apparent performance than the guarded approach. The package also supplies grouped and time-ordered resampling, blocks risky configurations, audits for external dependencies, and delivers integrated explanations. In benchmarks it matched the held-out accuracy of tidymodels workflows while cutting the amount of code needed for orchestration and supported unified survival-model comparisons.

Core claim

The central claim is that global preprocessing substantially inflates apparent performance relative to guarded resampling, and that fastml implements the guarded approach through a single-call interface that matches tidymodels held-out performance, reduces workflow orchestration, and enables consistent survival-model benchmarking across datasets of varying size.

What carries the argument

Guarded resampling, in which preprocessing transformations are re-estimated inside each training fold and applied only to the matching assessment set.

If this is right

  • Users obtain performance estimates that more closely reflect future deployment accuracy.
  • Automated pipelines in R can safely handle grouped or time-ordered data without manual per-fold coding.
  • Risk of over-optimistic model selection decreases when high-risk preprocessing patterns are blocked by default.
  • Survival analysis gains a unified interface that applies guarded steps consistently across model classes.
  • Workflow code length shrinks while maintaining auditability and explanation output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption could encourage other AutoML frameworks to treat guarded preprocessing as a default rather than an optional setting.
  • The same leakage pattern likely appears in non-R environments, suggesting a general need for language-agnostic guarded-resample primitives.
  • Practitioners may discover that previously published high accuracies were partly artifacts of global preprocessing, prompting re-evaluation of legacy benchmarks.
  • Extension to streaming or online learning settings would require time-aware guarded updates inside sliding windows.

Load-bearing premise

The Monte Carlo simulation and survival benchmarks are representative of real leakage scenarios and the package implementation introduces no new biases.

What would settle it

A controlled experiment on a held-out dataset that applies the identical models once with global preprocessing and once with fold-local preprocessing, then checks whether the performance gap matches the size reported in the simulation.

Figures

Figures reproduced from arXiv: 2604.05225 by Dincer Goksuluk, Eda Karaismailoglu, Selcuk Korkmaz.

Figure 1
Figure 1. Figure 1: Illustration of a leaky cross-validation workflow caused by global preprocessing. When fastml executes its guarded resampling path (i.e., when workflows are fitted through 3 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A shows the distribution of ROC AUC values across runs for the leaky and guarded workflows. Figure 2B shows the paired within-run change in AUC, highlighting the systematic drop in performance when preprocessing is moved inside the resampling loop [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗
read the original abstract

Preprocessing leakage arises when scaling, imputation, or other data-dependent transformations are estimated before resampling, inflating apparent performance while remaining hard to detect. We present fastml, an R package that provides a single-call interface for leakage-aware machine learning through guarded resampling, where preprocessing is re-estimated inside each resample and applied to the corresponding assessment data. The package supports grouped and time-ordered resampling, blocks high-risk configurations, audits recipes for external dependencies, and includes sandboxed execution and integrated model explanation. We evaluate fastml with a Monte Carlo simulation contrasting global and fold-local normalization, a usability comparison with tidymodels under matched specifications, and survival benchmarks across datasets of different sizes. The simulation demonstrates that global preprocessing substantially inflates apparent performance relative to guarded resampling. fastml matched held-out performance obtained with tidymodels while reducing workflow orchestration, and it supported consistent benchmarking of multiple survival model classes through a unified interface.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces the fastml R package, which provides a single-call interface for leakage-aware machine learning via guarded resampling (re-estimating preprocessing inside each resample fold). It supports grouped/time-ordered resampling, blocks high-risk configurations, audits recipes for external dependencies, includes sandboxed execution and model explanation. Evaluation consists of a Monte Carlo simulation contrasting global vs. fold-local normalization, a usability comparison against tidymodels under matched specifications, and survival benchmarks across datasets of different sizes. The simulation is said to show substantial performance inflation from global preprocessing; fastml is reported to match tidymodels held-out performance while reducing orchestration effort and to enable consistent survival model benchmarking.

Significance. If the guarded implementation introduces no new biases and the simulation correctly isolates preprocessing timing as the sole variable, the package would address a practically important source of non-reproducibility in R-based ML workflows. The explicit matching of held-out performance against tidymodels supplies an independent correctness check, and the unified interface for survival models plus support for advanced resampling schemes are concrete strengths that could reduce common leakage errors.

major comments (1)
  1. [Monte Carlo simulation description] Monte Carlo simulation description: the claim that global preprocessing 'substantially inflates apparent performance' is presented without any quantitative results (e.g., magnitude of inflation in accuracy/AUC, number of Monte Carlo replicates, standard errors or confidence intervals, dataset sizes or characteristics, or exact preprocessing steps). This information is load-bearing for the central empirical claim and for verifying that the simulation isolates timing of preprocessing as the sole variable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting the practical value of guarded resampling in addressing leakage issues in R-based machine learning. We address the single major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: Monte Carlo simulation description: the claim that global preprocessing 'substantially inflates apparent performance' is presented without any quantitative results (e.g., magnitude of inflation in accuracy/AUC, number of Monte Carlo replicates, standard errors or confidence intervals, dataset sizes or characteristics, or exact preprocessing steps). This information is load-bearing for the central empirical claim and for verifying that the simulation isolates timing of preprocessing as the sole variable.

    Authors: We agree that the current description of the Monte Carlo simulation in the manuscript lacks the specific quantitative details needed to fully substantiate the claim and allow independent verification. In the revised manuscript, we will expand the relevant section to report the number of Monte Carlo replicates, the characteristics of the datasets used (including sizes and feature properties), the exact preprocessing steps (such as the normalization method), and the observed performance differences with quantitative measures including the magnitude of inflation in metrics like AUC or accuracy, along with standard errors and confidence intervals. This will confirm that the simulation design isolates preprocessing timing as the sole variable and provide transparent support for the empirical findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; software package with external benchmark validation

full rationale

The paper introduces the fastml R package for guarded resampling to prevent preprocessing leakage and evaluates it via Monte Carlo simulation (global vs. fold-local normalization), usability comparison to tidymodels, and survival benchmarks. No mathematical derivations, equations, or fitted parameters are presented as predictions. The central claims rest on direct empirical contrasts and matching held-out performance against an independent package (tidymodels), with no self-citation chains, ansatzes, or renamings that reduce results to inputs by construction. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software package paper with no mathematical derivations, free parameters, axioms, or invented entities; the central contribution is an implementation and empirical comparison.

pith-pipeline@v0.9.0 · 5474 in / 1107 out tokens · 33748 ms · 2026-05-10T18:37:47.239207+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 23 canonical work pages

  1. [1]

    Survival regression with accelerated failure time model in XGBoost

    Avinash Barnwal, Hyunsu Cho, and Toby Dylan Hocking. Survival regression with accelerated failure time model in XGBoost . CoRR, abs/2006.04920, 2020. URL https://arxiv.org/abs/2006.04920

  2. [2]

    Bengio and Y

    Y. Bengio and Y. Grandvalet. No unbiased estimator of the variance of k-fold cross-validation. Journal of Machine Learning Research, 5 0 (Sep): 0 1089--1105, 2004

  3. [3]

    Ben \' tez

    Christoph Bergmeir and Jos \'e M. Ben \' tez. On the use of cross-validation for time series predictor evaluation. Information Sciences, 191: 0 192--213, 2012. doi:https://doi.org/10.1016/j.ins.2011.12.028

  4. [4]

    Machine Learning , author =

    Leo Breiman. Random forests. Machine Learning, 45 0 (1): 0 5--32, 2001. doi:10.1023/a:1010933404324

  5. [5]

    XGBoost : A scalable tree boosting system

    Tianqi Chen and Carlos Guestrin. XGBoost : A scalable tree boosting system. KDD '16, Cornell University, 2016. URL https://medial-earlysign.github.io/MR_Wiki/attachments/5537821/5537823.pdf

  6. [6]

    Seamless R and C++ Integration with Rcpp

    Dirk Eddelbuettel. Seamless R and C++ Integration with Rcpp . Springer New York, NY, first edition, 2013

  7. [7]

    u nner, Fritz J \

    John A. Foekens, Harry A. Peters, Maxime P. Look, Henk Portengen, Manfred Schmitt, Michael D. Kramer, Nils Br \"u nner, Fritz J \"a nicke, Marion E. Meijer-van Gelder, Sonja C. Henzen-Logmans, Wim L. J. van Putten, and Jan G. M. Klijn. The urokinase system of plasminogen activation and prognosis in 2780 breast cancer patients1. Cancer Research, 60 0 (3): ...

  8. [8]

    rsample: G eneral Resampling Infrastructure , 2025

    Hannah Frick, Fanny Chow, Max Kuhn, Michael Mahoney, Julia Silge, and Hadley Wickham. rsample: G eneral Resampling Infrastructure , 2025. URL https://CRAN.R-project.org/package=rsample. R package version 1.3.1

  9. [9]

    URLhttps://doi.org/10.1214/aos/1176345338

    Michael Friedman. Piecewise exponential models for survival data with covariates. The Annals of Statistics, 10 0 (1), 1982. doi:10.1214/aos/1176345693

  10. [10]

    h2o: R Interface for the H2O Scalable Machine Learning Platform , 2020

    Tomas Fryda, Erin LeDell, Navdeep Gill, Spencer Aiello, Anqi Fu, Arno Candel, Cliff Click, Tom Kraljevic, Tomas Nykodym, Patrick Aboyoun, Michal Kurka, Michal Malohlava, Sebastien Poirier, Wendy Wong, Ludi Rehak, Eric Eckstrand, Brandon Hill, Sebastian Vidrio, Surekha Jadhawani, Amy Wang, Raymond Peck, Jan Gorecki, Matt Dowle, Yuan Tang, Lauren DiPerna, V...

  11. [11]

    Califf, David B

    Jr Harrell, Frank E., Robert M. Califf, David B. Pryor, Kerry L. Lee, and Robert A. Rosati. Evaluating the yield of medical tests. JAMA , 247 0 (18): 0 2543--2546, 1982. doi:10.1001/jama.1982.03320430047030

  12. [12]

    The Elements of Statistical Learning

    Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, second edition, 2009

  13. [13]

    censored: `parsnip' Engines for Survival Models , 2025

    Emil Hvitfeldt and Hannah Frick. censored: `parsnip' Engines for Survival Models , 2025. URL https://github.com/tidymodels/censored. R package version 0.3.3

  14. [14]

    Forecasting: principles and practice

    Rob J Hyndman and George Athanasopoulos. Forecasting: principles and practice. OTexts, 2018

  15. [15]

    flexsurv: A platform for parametric survival modeling in R

    Christopher Jackson. flexsurv: A platform for parametric survival modeling in R . Journal of Statistical Software, 70 0 (8): 0 1--33, 2016. doi:10.18637/jss.v070.i08

  16. [16]

    The Statistical Analysis of Failure Time Data

    J D Kalbfleisch and R L Prentice. The Statistical Analysis of Failure Time Data. Wiley, second edition, 2002

  17. [17]

    Leakage and the reproducibility crisis in machine-learning-based science

    Sayash Kapoor and Arvind Narayanan. Leakage and the reproducibility crisis in machine-learning-based science. Patterns, 4 0 (9): 0 100804, 2023. doi:10.1016/j.patter.2023.100804

  18. [18]

    Leakage in data mining: F ormulation, detection, and avoidance

    Shachar Kaufman, Saharon Rosset, Claudia Perlich, and Ori Stitelman. Leakage in data mining: F ormulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data, 6 0 (4), 2012. doi:10.1145/2382577.2382579

  19. [19]

    LightGBM : A highly efficient gradient boosting decision tree

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM : A highly efficient gradient boosting decision tree. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017...

  20. [20]

    fastml: Guarded Resampling Workflows for Safe and Automated Machine Learning in R , 2025

    Selcuk Korkmaz, Dincer Goksuluk, and Eda Karaismailoglu. fastml: Guarded Resampling Workflows for Safe and Automated Machine Learning in R , 2025. URL https://CRAN.R-project.org/package=fastml. R package version 0.7.8

  21. [21]

    Building predictive models in R using the caret package

    Max Kuhn. Building predictive models in R using the caret package. Journal of Statistical Software, 28 0 (5): 0 1--26, 2008. doi:10.18637/jss.v028.i05

  22. [22]

    Applied predictive modeling, volume 26

    Max Kuhn and Kjell Johnson. Applied predictive modeling, volume 26. Springer, 2013

  23. [23]

    Tidymodels : a collection of packages for modeling and machine learning using tidyverse principles

    Max Kuhn and Hadley Wickham. Tidymodels : a collection of packages for modeling and machine learning using tidyverse principles. , 2020. URL https://www.tidymodels.org

  24. [24]

    recipes: P reprocessing and Feature Engineering Steps for Modeling , 2025

    Max Kuhn, Hadley Wickham, and Emil Hvitfeldt. recipes: P reprocessing and Feature Engineering Steps for Modeling , 2025. URL https://CRAN.R-project.org/package=recipes. R package version 1.3.1

  25. [25]

    Kyle, Terry M

    Robert A. Kyle, Terry M. Therneau, S. Vincent Rajkumar, Dirk R. Larson, Matthew F. Plevak, Janice R. Offord, Angela Dispenzieri, Jerry A. Katzmann, and L. Joseph Melton. Prevalence of monoclonal gammopathy of undetermined significance. New England Journal of Medicine, 354 0 (13): 0 1362--1369, 2006. doi:10.1056/NEJMoa054494

  26. [26]

    mlr3 : A modern object-oriented machine learning framework in R

    Michel Lang, Martin Binder, Jakob Richter, Patrick Schratz, Florian Pfisterer, Stefan Coors, Quay Au, Giuseppe Casalicchio, Lars Kotthoff, and Bernd Bischl. mlr3 : A modern object-oriented machine learning framework in R . Journal of Open Source Software, dec 2019. doi:10.21105/joss.01903

  27. [27]

    J. F. Lawless. Statistical Models and Methods for Lifetime Data. Wiley, second edition, 2003

  28. [28]

    mlbench: M achine Learning Benchmark Problems , 2024

    Friedrich Leisch and Evgenia Dimitriadou. mlbench: M achine Learning Benchmark Problems , 2024. URL https://CRAN.R-project.org/package=mlbench. R package version 2.1-6

  29. [29]

    Roderick J. A. Little and Donald B. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, third edition, 2019

  30. [30]

    Prospective evaluation of prognostic variables from patient-completed questionnaires

    C L Loprinzi, J A Laurie, H S Wieand, J E Krook, P J Novotny, J W Kugler, J Bartel, M Law, M Bateman, and N E Klatt. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group . Journal of Clinical Oncology, 12 0 (3): 0 601--607, 1994. doi:10.1200/JCO.1994.12.3.601

  31. [31]

    A unified approach to interpreting model predictions

    Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/8a20a86219...

  32. [32]

    A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data

    J Luo, M Schumacher, A Scherer, D Sanoudou, D Megherbi, T Davison, T Shi, W Tong, L Shi, H Hong, C Zhao, F Elloumi, W Shi, R Thomas, S Lin, G Tillinghast, G Liu, Y Zhou, D Herman, Y Li, Y Deng, H Fang, P Bushel, M Woods, and J Zhang. A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expres...

  33. [33]

    The RAppArmor package: Enforcing security policies in R using dynamic sandboxing on Linux

    Jeroen Ooms. The RAppArmor package: Enforcing security policies in R using dynamic sandboxing on Linux . Journal of Statistical Software, 55 0 (7): 0 1--34, 2013. doi:10.18637/jss.v055.i07

  34. [34]

    Roger D. Peng. Reproducible research in computational science. Science, 334 0 (6060): 0 1226--1227, 2011. doi:10.1126/science.1213847

  35. [35]

    NHANES : Data from the US National Health and Nutrition Examination Study , 2025

    Randall Pruim. NHANES : Data from the US National Health and Nutrition Examination Study , 2025. URL https://CRAN.R-project.org/package=NHANES. R package version 2.1.0

  36. [36]

    and Bahn, Volker and Ciuti, Simone and Boyce, Mark S

    David R. Roberts, Volker Bahn, Simone Ciuti, Mark S. Boyce, Jane Elith, Gurutzeta Guillera-Arroita, Severin Hauenstein, Jos \'e J. Lahoz-Monfort, Boris Schr \"o der, Wilfried Thuiller, David I. Warton, Brendan A. Wintle, Florian Hartig, and Carsten F. Dormann. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic struc...

  37. [37]

    Nature Communications , author =

    Matthew Rosenblatt, Link Tejavibulya, Rongtao Jiang, Stephanie Noble, and Dustin Scheinost. Data leakage inflates prediction performance in connectome-based machine learning models. Nature Communications, 15 0 (1), 2024. doi:10.1038/s41467-024-46150-w

  38. [38]

    Using the ADAP learning algorithm to forecast the onset of diabetes mellitus

    J W Smith, J E Everhart, W C Dickson, W C Knowler, and R S Johannes. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. Proceedings of the Symposium on Computer Applications in Medical Care, pages 261--265, 1988

  39. [39]

    Stodden, F

    V. Stodden, F. Leisch, and R. D. Peng. Implementing Reproducible Research. CRC Press, 2014

  40. [40]

    Nick Street, William H

    W. Nick Street, W. H. Wolberg, and O. L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis . In Raj S. Acharya and Dmitry B. Goldgof, editors, Biomedical Image Processing and Biomedical Visualization, volume 1905, pages 861 -- 870. International Society for Optics and Photonics, SPIE, 1993. doi:10.1117/12.148698

  41. [41]

    Inflation of test accuracy due to data leakage in deep learning-based classification of OCT images

    Iulian Emil Tampu, Anders Eklund, and Neda Haj-Hosseini. Inflation of test accuracy due to data leakage in deep learning-based classification of OCT images. Scientific Data, 9 0 (1), 2022. doi:10.1038/s41597-022-01618-6

  42. [42]

    A Package for Survival Analysis in R , 2024

    Terry M Therneau. A Package for Survival Analysis in R , 2024. URL https://CRAN.R-project.org/package=survival. R package version 3.8-3

  43. [43]

    Therneau and Patricia M

    Terry M. Therneau and Patricia M. Grambsch. Modeling Survival Data: Extending the C ox Model . Springer, New York, 2000. ISBN 0-387-98784-3

  44. [44]

    Andrius Vabalas, Emma Gowen, Ellen Poliakoff, and Alexander J. Casson. Machine learning algorithm validation with a limited sample size. PLoS One, 14 0 (11): 0 1--20, 2019. doi:10.1371/journal.pone.0224365

  45. [45]

    van der Laan, Eric C

    Mark J. van der Laan, Eric C. Polley, and Alan E Hubbard. Super learner. U.C. Berkeley Division of Biostatistics Working Paper Series, Working Paper 222, 2007. URL https://biostats.bepress.com/ucbbiostat/paper222

  46. [46]

    Advanced R

    Hadley Wickham. Advanced R . CRC Press, second edition, 2019

  47. [47]

    Greg Wilson, D. A. Aruliah, C. Titus Brown, Neil P. Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Haddock, Kathryn D. Huff, Ian M. Mitchell, Mark D. Plumbley, Ben Waugh, Ethan P. White, and Paul Wilson. Best practices for scientific computing. PLoS Biology , 12 0 (1): 0 1--7, 2014. doi:10.1371/journal.pbio.1001745

  48. [48]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , author =

    Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67 0 (2): 0 301--320, 2005. doi:10.1111/j.1467-9868.2005.00503.x