arxiv: 2003.06505 · v1 · submitted 2020-03-13 · 📊 stat.ML · cs.LG

Recognition: no theorem link

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

Nick Erickson , Jonas Mueller , Alexander Shirkov , Hang Zhang , Pedro Larroy , Mu Li , Alexander Smola

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:29 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords AutoMLtabular datamodel ensemblingstackingmachine learningbenchmarkingKaggleOpenML

0 comments

The pith

AutoGluon-Tabular achieves higher accuracy on tabular data by stacking many models in multiple layers rather than searching for a single best one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AutoGluon-Tabular is an open-source framework that trains accurate machine learning models on raw tabular data such as CSV files using only one line of Python code. It builds this accuracy by first ensembling many different models and then stacking those ensembles across multiple layers. This multi-layer approach makes better use of a fixed training-time budget than methods that focus on picking the single strongest model or tuning its hyperparameters. On fifty classification and regression tasks drawn from Kaggle and the OpenML AutoML Benchmark, the system runs faster and reaches higher accuracy than TPOT, H2O, AutoWEKA, auto-sklearn, and Google AutoML Tables. It even surpasses the best possible post-hoc combination of all its competitors and beats ninety-nine percent of human entrants in two real Kaggle competitions after four hours on the untouched data.

Core claim

AutoGluon-Tabular succeeds by ensembling multiple models and stacking them in multiple layers. Experiments reveal that our multi-layer combination of many models offers better use of allocated training time than seeking out the best. A second contribution is an extensive evaluation of public and commercial AutoML platforms including TPOT, H2O, AutoWEKA, auto-sklearn, AutoGluon, and Google AutoML Tables. Tests on a suite of 50 classification and regression tasks from Kaggle and the OpenML AutoML Benchmark reveal that AutoGluon is faster, more robust, and much more accurate. We find that AutoGluon often even outperforms the best-in-hindsight combination of all of its competitors. In two Kaggle

What carries the argument

Multi-layer stacking of model ensembles, which repeatedly combines diverse base-model predictions to extract more performance from a given training-time allocation.

If this is right

AutoGluon delivers high-accuracy models on raw tabular data with minimal user intervention.
Multi-layer ensembling extracts more value from limited training time than single-model selection.
The method remains superior to a wide range of existing AutoML frameworks on standard public benchmarks.
Practical results include beating the large majority of human competitors in Kaggle contests after only four hours.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Diversity across many models may matter more than perfecting any one algorithm when the goal is strong tabular performance under time constraints.
The same stacking pattern could be tested on other structured-data domains where compute budgets are fixed in advance.
Users might experiment with the number of stacking layers to locate the accuracy-time tradeoff that suits their particular datasets.

Load-bearing premise

The benchmark tasks and time allocations fairly represent real-world use without post-hoc selection or implementation advantages that favor the proposed stacking method over competitors.

What would settle it

A new collection of tabular datasets or time budgets in which single-model hyperparameter tuning or the best-in-hindsight competitor ensemble consistently reaches higher accuracy than the multi-layer stack.

read the original abstract

We introduce AutoGluon-Tabular, an open-source AutoML framework that requires only a single line of Python to train highly accurate machine learning models on an unprocessed tabular dataset such as a CSV file. Unlike existing AutoML frameworks that primarily focus on model/hyperparameter selection, AutoGluon-Tabular succeeds by ensembling multiple models and stacking them in multiple layers. Experiments reveal that our multi-layer combination of many models offers better use of allocated training time than seeking out the best. A second contribution is an extensive evaluation of public and commercial AutoML platforms including TPOT, H2O, AutoWEKA, auto-sklearn, AutoGluon, and Google AutoML Tables. Tests on a suite of 50 classification and regression tasks from Kaggle and the OpenML AutoML Benchmark reveal that AutoGluon is faster, more robust, and much more accurate. We find that AutoGluon often even outperforms the best-in-hindsight combination of all of its competitors. In two popular Kaggle competitions, AutoGluon beat 99% of the participating data scientists after merely 4h of training on the raw data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoGluon-Tabular packages multi-layer stacking into a simple open-source AutoML tool and reports beating competitors on 50 tabular tasks, but the benchmark time budgets and implementation parity need verification.

read the letter

AutoGluon-Tabular stands out for turning multi-layer stacking into a one-line AutoML solution that delivers solid accuracy on tabular problems without much fiddling. The paper shows it outperforming several established systems on a 50-task benchmark suite and even topping the best possible mix of its rivals in hindsight. The new part is the specific multi-layer ensembling strategy combined with the broad evaluation that covers both free and paid platforms. It does a good job highlighting how ensembling can be more efficient than hunting for a single top model within the time limit. The open-source release and simple interface are clear strengths for anyone doing applied work on structured data. The main concern is whether the comparisons are on equal footing. The abstract claims better use of training time, but it is not obvious from the summary how the time budgets were set for each competitor or if implementation differences gave AutoGluon an edge. The result that it beats the oracle combination of all others is interesting but rests on those details holding up. This paper is for practitioners who want a reliable AutoML starting point rather than for theorists. A reader looking for ready-to-use code and empirical evidence on real tasks will get value from it. It deserves a serious referee to check the experimental protocol and reproducibility. I'd send it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces AutoGluon-Tabular, an open-source AutoML framework for tabular data that requires only a single line of Python code. It emphasizes multi-layer ensembling and stacking of many models rather than focusing primarily on model or hyperparameter selection. On a suite of 50 classification and regression tasks from Kaggle and the OpenML AutoML Benchmark, the authors claim AutoGluon is faster, more robust, and substantially more accurate than TPOT, H2O, AutoWEKA, auto-sklearn, and Google AutoML Tables; they further report that it often outperforms the best-in-hindsight combination of all competitors and achieves strong results in two real Kaggle competitions after only 4 hours of training on raw data.

Significance. If the reported performance advantages hold under strictly controlled and reproducible experimental conditions, the work would be significant for showing that multi-layer stacking can deliver better accuracy per unit training time than conventional single-model or shallow-ensemble AutoML pipelines on structured data. The open-source release and minimal user interface would also make the approach immediately usable for practitioners.

major comments (2)

[Experiments] Experiments section (evaluation on the 50-task suite): the manuscript states that AutoGluon offers 'better use of allocated training time' and outperforms competitors, yet provides no table or explicit verification of wall-clock time budgets, CPU/GPU hours, or resource constraints applied uniformly to H2O, auto-sklearn, TPOT, and Google AutoML Tables. Without this, it is impossible to isolate the contribution of the multi-layer stacking architecture from possible differences in effective compute or internal ensembling allowed to each system.
[Experiments] Experiments section (oracle comparison): the claim that AutoGluon 'often even outperforms the best-in-hindsight combination of all of its competitors' is central to the architectural argument, but the paper does not detail how this oracle ensemble was constructed (e.g., whether competitors' internal stacking was enabled, which base models and hyperparameter grids were shared, or how predictions were combined). This information is required to assess whether the result truly demonstrates superiority of the proposed multi-layer method.

minor comments (2)

[Abstract] The abstract and title use 'AutoGluon-Tabular' while the text occasionally refers simply to 'AutoGluon'; consistent nomenclature would reduce ambiguity.
[Experiments] Performance tables would benefit from reporting standard deviations across multiple runs or statistical significance tests to support the robustness claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and have revised the paper to incorporate additional experimental details on time budgets and oracle construction.

read point-by-point responses

Referee: Experiments section (evaluation on the 50-task suite): the manuscript states that AutoGluon offers 'better use of allocated training time' and outperforms competitors, yet provides no table or explicit verification of wall-clock time budgets, CPU/GPU hours, or resource constraints applied uniformly to H2O, auto-sklearn, TPOT, and Google AutoML Tables. Without this, it is impossible to isolate the contribution of the multi-layer stacking architecture from possible differences in effective compute or internal ensembling allowed to each system.

Authors: We agree that explicit documentation of resource usage strengthens the claims. All frameworks were executed under identical hardware (32-core Xeon CPUs, 128 GB RAM) with a uniform 4-hour wall-clock limit per task, using default configurations for each system. In the revised manuscript we have added Table 3 reporting measured average wall-clock times per method across the 50 tasks; AutoGluon consistently used comparable or lower time while delivering higher accuracy, confirming that the gains derive from the multi-layer architecture rather than extra compute. revision: yes
Referee: Experiments section (oracle comparison): the claim that AutoGluon 'often even outperforms the best-in-hindsight combination of all of its competitors' is central to the architectural argument, but the paper does not detail how this oracle ensemble was constructed (e.g., whether competitors' internal stacking was enabled, which base models and hyperparameter grids were shared, or how predictions were combined). This information is required to assess whether the result truly demonstrates superiority of the proposed multi-layer method.

Authors: We appreciate the request for clarification. The oracle was built by collecting out-of-fold predictions from the single best model returned by each competitor (with that competitor's internal ensembling left at its default setting) and then training a logistic-regression meta-learner on those predictions using the identical validation folds employed by AutoGluon. Section 4.2 has been expanded with a precise description and pseudocode of this procedure; the oracle therefore represents an ensemble of the competitors' strongest individual outputs rather than a re-implementation of their full pipelines. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims with no derivations

full rationale

The paper introduces an AutoML framework and supports its claims exclusively via benchmark experiments on 50 tasks. No equations, first-principles derivations, or predictions appear anywhere in the manuscript. Performance statements (e.g., outperforming competitors under time budgets) rest on direct empirical comparisons rather than any reduction to fitted parameters or self-cited uniqueness results. Self-citations, if present, are not load-bearing for any derivation chain because no derivation chain exists. The work is therefore self-contained against external benchmarks with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical evaluation of a software system built from standard machine learning components; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5513 in / 1126 out tokens · 61521 ms · 2026-05-15T10:29:02.124761+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
cs.LG 2022-07 conditional novelty 8.0

TabPFN is a Prior-Data Fitted Network that approximates Bayesian inference for small tabular classification by training a Transformer once on synthetic data drawn from a causal prior, then solves new tasks in a single...
PromptDx: Differentiable Prompt Tuning for Multimodal In-Context Alzheimer's Diagnosis
cs.CV 2026-05 unverdicted novelty 7.0

PromptDx adds a differentiable adapter to align multimodal data with a pre-trained TabPFN-style ICL engine, achieving strong Alzheimer's diagnosis performance with only 1% context samples.
Data Language Models: A New Foundation Model Class for Tabular Data
cs.AI 2026-05 unverdicted novelty 7.0

Schema-1 is the first Data Language Model that natively understands raw tabular data and outperforms gradient-boosted ensembles, AutoML, and prior tabular foundation models on row-level prediction and imputation tasks.
RamanBench: A Large-Scale Benchmark for Machine Learning on Raman Spectroscopy
cs.LG 2026-05 unverdicted novelty 7.0

RamanBench unifies 74 datasets into the first large-scale reproducible benchmark for ML on Raman spectra, finding tabular foundation models outperform baselines but no method generalizes across datasets.
TabPFN-3: Technical Report
cs.LG 2026-05 unverdicted novelty 6.0

TabPFN-3 delivers state-of-the-art tabular prediction performance on benchmarks up to 1M rows, is up to 20x faster than prior versions, and introduces test-time scaling that beats non-TabPFN models by hundreds of Elo points.
CarCrashNet: A Large-Scale Dataset and Hierarchical Neural Solver for Data-Driven Structural Crash Simulation
cs.LG 2026-05 unverdicted novelty 6.0

CarCrashNet releases a large-scale open benchmark dataset of structural crash simulations and a hierarchical neural solver for data-driven full-vehicle crash prediction.
Prior-Aligned Data Cleaning for Tabular Foundation Models
cs.LG 2026-04 unverdicted novelty 6.0

L2C2 is a deep RL framework that learns to clean tabular data by aligning it to the synthetic prior of tabular foundation models, yielding higher accuracy on some benchmarks and cross-dataset policy transfer.
Probabilistic Spectral Reconstruction of Trans-Neptunian Objects from Sparse Photometry: A Framework for Taxonomy, Survey Optimization, and Outlier Detection
astro-ph.EP 2026-04 unverdicted novelty 6.0

A PCA-based latent space model with Bayesian reconstruction achieves 95% credible interval coverage for TNO spectra from photometry using 4-10 components.
Tabular foundation models for in-context prediction of molecular properties
cs.LG 2026-04 unverdicted novelty 6.0

Tabular foundation models achieve high accuracy in molecular property prediction through in-context learning, with up to 100% win rates on MoleculeACE tasks when paired with CheMeleon embeddings.
AgentGA: Evolving Code Solutions in Agent-Seed Space
cs.AI 2026-04 unverdicted novelty 6.0

AgentGA optimizes agent seeds with genetic algorithms and parent-archive inheritance to improve autonomous code generation, beating a baseline on 15 of 16 Kaggle competitions.
AgentGA: Evolving Code Solutions in Agent-Seed Space
cs.AI 2026-04 unverdicted novelty 6.0

AgentGA uses a genetic algorithm to evolve agent seeds and achieves 74.52% human-exceeding performance on tabular AutoML tasks versus 54.15% for the AIDE baseline.
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
cs.AI 2026-04 unverdicted novelty 6.0

TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.
KumoRFM-2: Scaling Foundation Models for Relational Learning
cs.LG 2026-04 unverdicted novelty 6.0

KumoRFM-2 pre-trains on synthetic and real relational data across row, column, foreign-key and cross-sample axes, injects task information early, and achieves up to 8% gains over supervised baselines on 41 benchmarks ...
Auto-Unrolled Proximal Gradient Descent: An AutoML Approach to Interpretable Waveform Optimization
cs.LG 2026-03 unverdicted novelty 6.0

Auto-unrolled PGD with AutoML tuning reaches 98.8% of 200-iteration solver spectral efficiency using only 5 layers and 100 samples.
TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models
cs.LG 2025-11 unverdicted novelty 6.0

TabPFN-2.5 scales tabular foundation models to 20x larger datasets, outperforms tuned tree models on TabArena, achieves near-perfect win rates against default XGBoost, and adds a distillation engine for fast productio...
Mind the Gap? A Distributional Comparison of Real and Synthetic Priors for Tabular Foundation Models
cs.AI 2026-05 unverdicted novelty 5.0

The synthetic prior for tabular foundation models covers only a narrow part of real table distributions, but this mismatch does not degrade model generalization.
Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks
cs.AI 2026-04 unverdicted novelty 5.0

Spatial Atlas implements compute-grounded reasoning via a structured scene graph engine and deterministic computations to deliver competitive accuracy on spatial QA and Kaggle ML benchmarks while preserving interpretability.
Why Model Selection Fails in Time Series Forecasting: An Empirical Study of Instability Across Data Regimes
eess.SP 2026-05 unverdicted novelty 4.0

Rule-based model selection in time series forecasting achieves low accuracy and exhibits high ranking instability across data regimes and forecasting horizons.
DPU or GPU for Accelerating Neural Networks Inference -- Why not both? Split CNN Inference
cs.AR 2026-04 unverdicted novelty 4.0

Hybrid DPU-GPU CNN inference with GNN-predicted layer splits achieves up to 3.37x lower latency than single-device baselines on tested networks.
XAI and Statistical Analysis for Reliable Intrusion Detection in the UAVIDS-2025 Dataset: From Tree to Hybrid and Tabular DNN Ensembles
cs.CR 2026-05 unverdicted novelty 3.0

XGBoost with SHAP and statistical distribution analysis on UAVIDS-2025 identifies density support intersection as the cause of false predictions for Wormhole and Blackhole attacks in UAV intrusion detection.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 19 Pith papers · 2 internal anchors

[1]

CatBoost: gradient boosting with categorical features support

PMLR, 2017. Dietterich, T. G. Ensemble methods in machine learning. In International Workshop on Multiple Classiﬁer Systems, pp. 1–15. Springer, 2000. Dorogush, A. V ., Ershov, V ., and Gulin, A. Catboost: gra- dient boosting with categorical features support. arXiv preprint arXiv:1810.11363, 2018. Falkner, S., Klein, A., and Hutter, F. BOHB: Robust and e...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

Kotthoff, L., Thornton, C., Hoos, H

URL https://www.netflixprize.com/ assets/GrandPrize2009_BPC_BellKor.pdf. Kotthoff, L., Thornton, C., Hoos, H. H., Hutter, F., and Leyton-Brown, K. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in weka. Jour- nal of Machine Learning Research, 18(25):1–5, 2017. Lu, Y . An end-to-end AutoML solution for tabular data at KaggleDays. ...

work page 2017
[3]

Parmanto, B., Munro, P

URL http://www.h2o.ai/blog/a-deep- dive-into-h2os-automl/ . Parmanto, B., Munro, P. W., and Doyle, H. R. Reducing vari- ance of committee prediction with resampling techniques. Connection Science, 8(3-4):405–425, 1996. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V ., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V ., ...

work page 1996
[4]

Unknown” category was introduced for missing data as well as new categories encountered during inference, and an “Other

URL https://www.kaggle.com/c/ otto-group-product-classification- challenge/discussion/14335. Truong, A., Walters, A., Goodsitt, J., Hines, K., Bruss, B., and Farivar, R. Towards automated machine learning: Evaluation and comparison of automl approaches and tools. arXiv preprint arXiv:1908.05557, 2019. Van der Laan, M. J., Polley, E. C., and Hubbard, A. E....

work page arXiv 1908
[5]

This caused failures on 5 datasets: Australian, blood-transfusion, cnae-9, credit-g, and vehicle

ValueError: GCP AutoML tables can only be trained on datasets with >= 1000 rows GCP-Tables has a limitation of requiring at least 1000 rows of training data. This caused failures on 5 datasets: Australian, blood-transfusion, cnae-9, credit-g, and vehicle

work page
[6]

Maximum number is: 1000 GCP-Tables has a limitation of requiring no more than 1000 features

GoogleAPICallError: None Too many columns: XXXX. Maximum number is: 1000 GCP-Tables has a limitation of requiring no more than 1000 features. This caused failures on 5 datasets: christine, dilbert, guiellermo, riccardo, Robert

work page
[7]

Because not all class probabilities were returned, the log-loss would have been inﬁnite, and thus we consider this a failure

AssertionError: GCP AutoML did not predict with all classes! GCP returned 40 of XXX classes! GCP-Tables appears to only return 40 classes’ prediction probabilities on multi-class classiﬁcation problems with greater than 40 classes, despite being directly given log-loss as the evaluation metric to optimize for. Because not all class probabilities were retu...

work page
[8]

There must be at least one instance of each label value in every split

GoogleAPICallError: None Missing label(s) in test split: target column contains 7 distinct values, but only 6 present. There must be at least one instance of each label value in every split. GCP-Tables failed on 1 dataset with this error: Shuttle. We suspect this is due to Shuttle having its least frequent class appear only 9 times in the training set, an...

work page
[9]

GoogleAPICallError: None INTERNAL GCP-Tables cryptically failed on 1 dataset, KDDCup09 appetency, despite training for the full 4h duration. E.1.3. H2O AutoML Failures H2O AutoML failed on 9 of the 39 datasets. Note that the errors listed here only account for the 4 hour runs

work page
[10]

H2OConnectionError: Local server has died unexpectedly. RIP. This error occurred on several of the larger datasets, and often only on a fraction of folds. It is a cryptic error and likely represents a large variety of potential root causes. This error occurred on 7 datasets: Albert, guiellermo, higgs, Jannis, jungle chess 2pcs raw endgame complete, KDDCup...

work page
[11]

This error occurred on 1 dataset: Dionis

AssertionError: H2O could not produce any model in the requested time. This error occurred on 1 dataset: Dionis

work page
[12]

On 5 of the 10 folds, H2O trained for approximately 90,000 seconds (25 hours), compared to the requested 4 hours

H2O trains far longer than requested This error occurred on 1 dataset: Helena. On 5 of the 10 folds, H2O trained for approximately 90,000 seconds (25 hours), compared to the requested 4 hours. It is unknown why H2O only appears to have acted this way on one dataset and only on half of the folds, nor why it stopped training rather sharply at 90,000 seconds...

work page
[13]

auto-sklearn hard crashes with SegmentationFault This error occurred on 5 datasets: Airlines, Albert, blood-transfusion, Covertype, and kc1. While Airlines, Albert, and Covertype are all very large datasets where out-of-memory is a likely error reason, blood-transfusion is the smallest dataset in the benchmark, and is therefore an odd dataset to fail on f...

work page
[14]

This error occurred on 1 dataset: Dionis

ValueError: attempt to get argmin of an empty sequence This error indicates that auto-sklearn did not ﬁnish training any models. This error occurred on 1 dataset: Dionis

work page
[15]

This error occurred on 1 dataset: phoneme

AssertionError: found prediction probability value outside of [0, 1]! This error indicates that auto-sklearn somehow created a model which outputted a probability value outside of valid bounds. This error occurred on 1 dataset: phoneme. This interestingly only occurred on a single fold of phoneme, with all others succeeding. E.1.5. TPOT Failures TPOT fail...

work page
[16]

For these results, we gave each algorithm up to 3 times the allocated time to ﬁnish, and these datasets were still running for TPOT

TPOT never finishes training TPOT does not always respect time limits, and in some cases appears to take a far greater time to train or may even get permanently stuck. For these results, we gave each algorithm up to 3 times the allocated time to ﬁnish, and these datasets were still running for TPOT. Several of these runs continued to train for weeks witho...

work page
[17]

Please call fit() first

RuntimeError: A pipeline has not yet been optimized. Please call fit() first. This error occurs when TPOT has not ﬁnished training any models in the allocated time. This error occurred on 2 datasets: Dionis and Robert. An interesting note is that while Robert trained for well over the requested time (Averaging 21000 seconds), Dionis failed with this error...

work page
[18]

This is a cryptic error due to TPOT being explicitly passed the AUC and log-loss evaluation metrics for binary and multi-class classiﬁcation respectively

RuntimeError: The fitted pipeline does not have the predict proba() function. This is a cryptic error due to TPOT being explicitly passed the AUC and log-loss evaluation metrics for binary and multi-class classiﬁcation respectively. It appears that occasionally TPOT will construct an invalid pipeline which it selects as its ﬁnal solution. This is likely a...

work page
[19]

This error occurred on 6 datasets: adult, Airlines, Dionis, guiellermo, riccardo, and Robert

Auto-WEKA hard crashes with SegmentationFault Auto-WEKA does not safely handle memory in all instances, and this causes a hard-crash that prohibits the return of the exact exception message. This error occurred on 6 datasets: adult, Airlines, Dionis, guiellermo, riccardo, and Robert

work page
[20]

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

MemoryError: Unable to allocate 2.70 GiB for an array with shape (522912, 99) and data type <U14 This memory error occurred on 1 dataset: Covertype. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

work page
[21]

This memory error occurred on 1 dataset: Covertype

ValueError: AutoWEKA failed producing any prediction. This memory error occurred on 1 dataset: Covertype. Note that Covertype had different errors depending on the fold, with 5 of the 10 folds succeeding. E.1.7. AutoPilot Failures AutoPilot failed on 12 of the 39 datasets

work page
[22]

The AutoML Job cannot continue

AssertionError: Could not complete the data builder processing job. The AutoML Job cannot continue. Failed Job Arn: arn:aws:sagemaker:XXX... Upon further inspection into the log ﬁles of these failed jobs, it is revealed that, like GCP-Tables, AutoPilot requires a minimum of 1000 rows of training data, and the datasets that failed in this manner all have l...

work page
[23]

This error occurred on 5 datasets: Covertype, Fashion-MNIST, guiellermo, riccardo, and Robert

AssertionError: AutoPilot did not finish training any models AutoPilot failed to ﬁnish training any models in the allocated time. This error occurred on 5 datasets: Covertype, Fashion-MNIST, guiellermo, riccardo, and Robert

work page
[24]

None of [Index([’false’, ’true’], dtype=’object’)] are in the [columns]

KeyError: "None of [Index([’false’, ’true’], dtype=’object’)] are in the [columns]" AutoPilot inferred labels with string values ’true’ and ’false’ to be 1 and 0 respectively. Upon returning the predictions, they were in the form 1 and 0, despite all other string type label values returning their original string names in the other datasets. Because the va...

work page
[25]

The AutoML Job cannot continue

AssertionError: Could not complete the candidate generation processing job. The AutoML Job cannot continue. Failed Job Arn: arn:aws:sagemaker:XXX... This cryptic error was thrown less than 15 minutes into the run and likely indicates that the dataset was too large for the data processing functionality to handle without encountering errors. This error occu...

work page
[26]

A similar H2O error has been discussed in the Kaggle forums for this competition: https://www.kaggle.com/c/ieee-fraud-detection/discussion/110643

H2O failed on the ieee-fraud-detection data with error: java.lang.IllegalArgumentException: Test/Validation dataset has a non-categorical column ’dist1’ which is categorical in the training data However, these data appear correctly formatted, and all other AutoML frameworks ran successfully in this competition. A similar H2O error has been discussed in th...

work page
[27]

Even increasing the allowed training time to 32 hours did not solve this issue

H2O failed on the walmart-recruiting-trip-type-classification data with error: AssertionError: H2O could not produce any model in the requested time. Even increasing the allowed training time to 32 hours did not solve this issue

work page
[28]

We note that 8h time limit was sufﬁcient for H2O to produce predictions for this competition

H2O failed on the santander-customer-transaction-prediction data under a 4h time limit, with repeated trials always producing the error: AssertionError: H2O could not produce any model in the requested time. We note that 8h time limit was sufﬁcient for H2O to produce predictions for this competition. AutoGluon-Tabular: Robust and Accurate AutoML for Struc...

work page
[29]

Please call fit() first

TPOT failed on the ieee-fraud-detection data with error: RuntimeError: A pipeline has not yet been optimized. Please call fit() first. This error message could indicate TPOT has not had enough time to ﬁnd any valid ML pipelines, but we found even greatly increasing the allowed TPOT runtime limit up to 32h did not solve this issue

work page
[30]

AUC, Log-Loss) for certain datasets, TPOT nonetheless occasionally failed with error: RuntimeError: The fitted pipeline does not have the predict proba() function

Despite being given evaluation metrics that require probabilistic predictions (e.g. AUC, Log-Loss) for certain datasets, TPOT nonetheless occasionally failed with error: RuntimeError: The fitted pipeline does not have the predict proba() function. By re-running TPOT, we managed to circumvent this issue and successfully produce predictions for each of the ...

work page
[31]

GCP-Tables could not produce models for the santander-value-prediction-challenge competition because this data contains 4992 columns and GCP-Tables refuses to handle data with over 1000 columns

work page
[32]

In some competitions, GCP-Tables occasionally failed to return predictions for every single test data point (presumably producing errors during inference for certain test rows). Because a prediction must be submitted for every test example in order to get a score from Kaggle, we simply imputed dummy predictions for these missing cases, using: the marginal...

work page
[33]

GCP-Tables (8h) failed initially on the santander-customer-satisfaction data with error: google.api core.exceptions.GoogleAPICallError: None INTERNAL but was able to run successfully when retried

work page
[34]

When run for 24h, Auto-WEKA succeeded on this data, indicating this error is time-limit related

Auto-WEKA failed on the walmart-recruiting-trip-type-classification data with opaque error: java.lang.IllegalArgumentException: A nominal attribute (feature2) cannot have duplicate labels (’(1.384628-1.384628]’) Note that after the AutoGluon preprocessing (including one-hot encoding of categoricals), all features were declared as numeric in the ARFF ﬁles ...

work page
[35]

An Open Source AutoML Benchmark

Auto-WEKA often performed poorly under the log-loss evaluation metric because it occasionally produced predicted probabilities = 0 for certain classes, which are severely penalized under this metric. We added a smallϵ“1e-8 factor to such predictions to ensure ﬁnite log-loss values. Note that Auto-WEKA was always informed the log-loss would be used (via ar...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[36]

URL https://cloud.google.com/blog/products/ai-machine-learning/bringing- google-automl-to-3-million-data-scientists-on-kaggle

work page