An Open Source AutoML Benchmark

Bernd Bischl; Erin LeDell; Janek Thomas; Joaquin Vanschoren; Pieter Gijsbers; S\'ebastien Poirier

arxiv: 1907.00909 · v1 · pith:OIMCYQUPnew · submitted 2019-07-01 · 💻 cs.LG · stat.ML

An Open Source AutoML Benchmark

Pieter Gijsbers , Erin LeDell , Janek Thomas , S\'ebastien Poirier , Bernd Bischl , Joaquin Vanschoren This is my paper

Pith reviewed 2026-05-25 11:50 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords automlframeworkbenchmarkdatasetsopenresultssystemsacross

0 comments

The pith

Introduces an open, ongoing benchmark framework for AutoML and compares four systems across 39 public datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Automated machine learning tries to make it easier to build good models without lots of manual work. Different AutoML tools exist, but comparing them fairly has been hard because people use different datasets, metrics, and rules. This paper creates a standard way to test these tools that is open to everyone, uses only public data, and is designed to avoid common errors in comparisons. They ran four different AutoML systems on 39 datasets and put the results on a website that can be updated as new systems appear. The framework can be extended by others adding more tools or data.

Core claim

We introduce an open, ongoing, and extensible benchmark framework which follows best practices and avoids common mistakes. The framework is open-source, uses public datasets and has a website with up-to-date results. We use the framework to conduct a thorough comparison of 4 AutoML systems across 39 datasets and analyze the results.

Load-bearing premise

That the selected 39 datasets, evaluation protocol, and chosen best practices are representative enough to support general conclusions about the relative performance of AutoML systems.

read the original abstract

In recent years, an active field of research has developed around automated machine learning (AutoML). Unfortunately, comparing different AutoML systems is hard and often done incorrectly. We introduce an open, ongoing, and extensible benchmark framework which follows best practices and avoids common mistakes. The framework is open-source, uses public datasets and has a website with up-to-date results. We use the framework to conduct a thorough comparison of 4 AutoML systems across 39 datasets and analyze the results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces an open, ongoing, and extensible benchmark framework for AutoML which follows best practices and avoids common mistakes. The framework is open-source, uses public datasets, maintains a website with up-to-date results, and is demonstrated via a comparison of 4 AutoML systems across 39 datasets with analysis of the results.

Significance. If the framework's protocol is sound and the dataset selection representative, this provides a valuable community resource for standardized, reproducible AutoML comparisons. The open-source nature, public datasets, and ongoing website are explicit strengths that support extensibility and reduce barriers to entry for future evaluations.

major comments (2)

[Abstract] Abstract: the central claim that the framework 'follows best practices and avoids common mistakes' is load-bearing for both the framework contribution and the 'thorough comparison,' yet no explicit justification, dataset selection criteria, diversity metrics, or protocol details (e.g., time budgets, CV implementation, leakage prevention) are supplied to support this assertion.
[Experiments / Results] The comparison of 4 systems on 39 datasets is presented as enabling analysis of relative performance, but the manuscript provides no sensitivity analysis or evidence that the 39 datasets are diverse in domain, size, feature type, and difficulty; this assumption is load-bearing for any general conclusions drawn from the results.

minor comments (2)

Clarify in the text how the website will be maintained for ongoing updates and how new systems or datasets can be added to the framework.
Ensure all code, datasets, and evaluation scripts are linked with DOIs or persistent identifiers in addition to the GitHub reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recognition of the framework's openness and extensibility. We address the two major comments point-by-point below, proposing targeted revisions to strengthen the justification and evidence without altering the core contribution.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the framework 'follows best practices and avoids common mistakes' is load-bearing for both the framework contribution and the 'thorough comparison,' yet no explicit justification, dataset selection criteria, diversity metrics, or protocol details (e.g., time budgets, CV implementation, leakage prevention) are supplied to support this assertion.

Authors: The manuscript body (Sections 2 and 3) explicitly describes the protocol choices that follow established best practices from the AutoML literature, including public OpenML datasets, stratified 10-fold CV, fixed per-dataset time budgets, and safeguards against leakage via proper train/test splits. The abstract is space-constrained and therefore omits these details, but we agree the claim would be better supported by a concise reference. We will revise the abstract to note that the protocol is detailed in the methods and avoids common pitfalls such as data leakage and inconsistent evaluation. revision: partial
Referee: [Experiments / Results] The comparison of 4 systems on 39 datasets is presented as enabling analysis of relative performance, but the manuscript provides no sensitivity analysis or evidence that the 39 datasets are diverse in domain, size, feature type, and difficulty; this assumption is load-bearing for any general conclusions drawn from the results.

Authors: Dataset selection was performed from OpenML to span multiple domains (e.g., biology, finance, image-derived), instance counts (hundreds to >100k), feature types, and difficulty levels as measured by baseline performance; these characteristics are summarized in the experiments section and the accompanying website. We acknowledge that an explicit diversity table and sensitivity check would make the representativeness clearer. We will add a table of dataset statistics (instances, features, classes, domain) and a short paragraph discussing coverage and limitations of the collection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark framework with no derivation chain

full rationale

This paper introduces an open benchmark framework for comparing AutoML systems and reports results on 39 public datasets. It contains no mathematical derivations, first-principles predictions, fitted parameters renamed as predictions, or uniqueness theorems. The central claims rest on the framework's design choices and empirical runs against external public data, with no self-referential definitions or load-bearing self-citations that reduce the results to the inputs by construction. The representativeness concern is a validity issue, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that public datasets and a fixed evaluation protocol can produce meaningful comparisons of AutoML systems; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Public datasets and a standardized protocol suffice for fair AutoML comparisons
Invoked to justify the benchmark design and the comparison of four systems.

pith-pipeline@v0.9.0 · 5614 in / 1087 out tokens · 30742 ms · 2026-05-25T11:50:27.391479+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection
cs.LG 2026-02 accept novelty 8.0

MacrOData supplies three large, curated benchmark suites totaling 2,446 datasets for tabular outlier detection, complete with standardized splits, metadata, and a public leaderboard.
Auto-FP: An Experimental Study of Automated Feature Preprocessing for Tabular Data
cs.LG 2023-10 unverdicted novelty 7.0

Experimental comparison of 15 HPO and NAS algorithms for automated feature preprocessing on 45 tabular datasets finds evolution-based methods and random search as top performers.
AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data
stat.ML 2020-03 unverdicted novelty 5.0

AutoGluon-Tabular achieves superior accuracy on tabular classification and regression by multi-layer model ensembling and stacking, outperforming other AutoML frameworks on 50 benchmarks and Kaggle competitions.