An Open Source AutoML Benchmark
Pith reviewed 2026-05-25 11:50 UTC · model grok-4.3
The pith
Introduces an open, ongoing benchmark framework for AutoML and compares four systems across 39 public datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce an open, ongoing, and extensible benchmark framework which follows best practices and avoids common mistakes. The framework is open-source, uses public datasets and has a website with up-to-date results. We use the framework to conduct a thorough comparison of 4 AutoML systems across 39 datasets and analyze the results.
Load-bearing premise
That the selected 39 datasets, evaluation protocol, and chosen best practices are representative enough to support general conclusions about the relative performance of AutoML systems.
read the original abstract
In recent years, an active field of research has developed around automated machine learning (AutoML). Unfortunately, comparing different AutoML systems is hard and often done incorrectly. We introduce an open, ongoing, and extensible benchmark framework which follows best practices and avoids common mistakes. The framework is open-source, uses public datasets and has a website with up-to-date results. We use the framework to conduct a thorough comparison of 4 AutoML systems across 39 datasets and analyze the results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces an open, ongoing, and extensible benchmark framework for AutoML which follows best practices and avoids common mistakes. The framework is open-source, uses public datasets, maintains a website with up-to-date results, and is demonstrated via a comparison of 4 AutoML systems across 39 datasets with analysis of the results.
Significance. If the framework's protocol is sound and the dataset selection representative, this provides a valuable community resource for standardized, reproducible AutoML comparisons. The open-source nature, public datasets, and ongoing website are explicit strengths that support extensibility and reduce barriers to entry for future evaluations.
major comments (2)
- [Abstract] Abstract: the central claim that the framework 'follows best practices and avoids common mistakes' is load-bearing for both the framework contribution and the 'thorough comparison,' yet no explicit justification, dataset selection criteria, diversity metrics, or protocol details (e.g., time budgets, CV implementation, leakage prevention) are supplied to support this assertion.
- [Experiments / Results] The comparison of 4 systems on 39 datasets is presented as enabling analysis of relative performance, but the manuscript provides no sensitivity analysis or evidence that the 39 datasets are diverse in domain, size, feature type, and difficulty; this assumption is load-bearing for any general conclusions drawn from the results.
minor comments (2)
- Clarify in the text how the website will be maintained for ongoing updates and how new systems or datasets can be added to the framework.
- Ensure all code, datasets, and evaluation scripts are linked with DOIs or persistent identifiers in addition to the GitHub reference.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive recognition of the framework's openness and extensibility. We address the two major comments point-by-point below, proposing targeted revisions to strengthen the justification and evidence without altering the core contribution.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the framework 'follows best practices and avoids common mistakes' is load-bearing for both the framework contribution and the 'thorough comparison,' yet no explicit justification, dataset selection criteria, diversity metrics, or protocol details (e.g., time budgets, CV implementation, leakage prevention) are supplied to support this assertion.
Authors: The manuscript body (Sections 2 and 3) explicitly describes the protocol choices that follow established best practices from the AutoML literature, including public OpenML datasets, stratified 10-fold CV, fixed per-dataset time budgets, and safeguards against leakage via proper train/test splits. The abstract is space-constrained and therefore omits these details, but we agree the claim would be better supported by a concise reference. We will revise the abstract to note that the protocol is detailed in the methods and avoids common pitfalls such as data leakage and inconsistent evaluation. revision: partial
-
Referee: [Experiments / Results] The comparison of 4 systems on 39 datasets is presented as enabling analysis of relative performance, but the manuscript provides no sensitivity analysis or evidence that the 39 datasets are diverse in domain, size, feature type, and difficulty; this assumption is load-bearing for any general conclusions drawn from the results.
Authors: Dataset selection was performed from OpenML to span multiple domains (e.g., biology, finance, image-derived), instance counts (hundreds to >100k), feature types, and difficulty levels as measured by baseline performance; these characteristics are summarized in the experiments section and the accompanying website. We acknowledge that an explicit diversity table and sensitivity check would make the representativeness clearer. We will add a table of dataset statistics (instances, features, classes, domain) and a short paragraph discussing coverage and limitations of the collection. revision: yes
Circularity Check
No circularity: empirical benchmark framework with no derivation chain
full rationale
This paper introduces an open benchmark framework for comparing AutoML systems and reports results on 39 public datasets. It contains no mathematical derivations, first-principles predictions, fitted parameters renamed as predictions, or uniqueness theorems. The central claims rest on the framework's design choices and empirical runs against external public data, with no self-referential definitions or load-bearing self-citations that reduce the results to the inputs by construction. The representativeness concern is a validity issue, not a circularity issue.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Public datasets and a standardized protocol suffice for fair AutoML comparisons
Forward citations
Cited by 3 Pith papers
-
MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection
MacrOData supplies three large, curated benchmark suites totaling 2,446 datasets for tabular outlier detection, complete with standardized splits, metadata, and a public leaderboard.
-
Auto-FP: An Experimental Study of Automated Feature Preprocessing for Tabular Data
Experimental comparison of 15 HPO and NAS algorithms for automated feature preprocessing on 45 tabular datasets finds evolution-based methods and random search as top performers.
-
AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data
AutoGluon-Tabular achieves superior accuracy on tabular classification and regression by multi-layer model ensembling and stacking, outperforming other AutoML frameworks on 50 benchmarks and Kaggle competitions.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.