Revisiting Metafeatures to Explain Model Differences on Tabular Data

Andrej Tschalzev; Christian Bartelt; Markus Herre; Sascha Marton

arxiv: 2605.28418 · v2 · pith:JVLSI7OTnew · submitted 2026-05-27 · 💻 cs.LG

Revisiting Metafeatures to Explain Model Differences on Tabular Data

Markus Herre , Andrej Tschalzev , Sascha Marton , Christian Bartelt This is my paper

Pith reviewed 2026-06-29 14:30 UTC · model grok-4.3

classification 💻 cs.LG

keywords tabular datameta-featuresmodel performanceTabArenafoundation modelsstatistical testsheterogeneityperformance gaps

0 comments

The pith

Meta-features fail to explain performance gaps between model families on tabular data after strict tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether dataset meta-features can account for why different model families perform differently on tabular prediction tasks. It uses the TabArena benchmark results on 51 datasets and applies statistical tests with false discovery rate control to check associations between model-agnostic meta-features and performance differences. No meta-feature survives the tests for neural network versus tree gaps. One association holds for non-foundation versus foundation model gaps but fails to generalize in leave-one-dataset-out checks. One association for TabICLv2 versus TabPFN-2.6 is robust and improves held-out prediction. Meta-feature based predictors add no meaningful value over a simple baseline in cross-dataset tests. The results highlight the heterogeneity of tabular datasets and the limits of global meta-feature explanations.

Core claim

After applying strict statistical tests with false discovery rate control to the TabArena benchmark, no meta-feature survives for explaining neural network versus tree model gaps; one association is robust for non-foundation versus foundation model gaps but does not generalize in leave-one-dataset-out tests; and one robust association for TabICLv2 versus TabPFN-2.6 does improve held-out prediction. Leave-one-dataset-out analysis shows meta-feature predictors do not meaningfully outperform a simple baseline, indicating heterogeneity in tabular datasets and limited robustness of global meta-feature approaches.

What carries the argument

Model-agnostic dataset meta-features linked to performance gaps via statistical tests with FDR control on the TabArena benchmark results.

If this is right

Global meta-feature approaches are not robust enough to explain model differences on the 51 TabArena datasets.
Tabular datasets show high heterogeneity that limits universal explanations.
One specific association between a meta-feature and the TabICLv2 versus TabPFN-2.6 gap improves held-out prediction.
Meta-feature predictors do not meaningfully beat a simple baseline in leave-one-dataset-out analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dataset-specific or local descriptors may be needed instead of global meta-features for model selection explanations.
Non-linear or interaction-based relationships among meta-features could be tested in follow-up work.
Model choice on tabular tasks may hinge on factors outside standard meta-feature sets.

Load-bearing premise

The analysis assumes that the TabArena benchmark performance estimates are stable and representative, and that the selected model-agnostic meta-features plus the chosen statistical tests would detect any genuine explanatory relationships if they existed.

What would settle it

A collection of new tabular datasets where at least one meta-feature consistently predicts performance gaps across model family comparisons, survives FDR control, and improves prediction accuracy in leave-one-dataset-out validation.

Figures

Figures reproduced from arXiv: 2605.28418 by Andrej Tschalzev, Christian Bartelt, Markus Herre, Sascha Marton.

**Figure 2.** Figure 2: Leave-one-dataset-out MAE. Points show the held-out gap prediction error. Lower values indicate better prediction. Non-TFM vs. TFM TabICL v2 vs TabPFN v2.6 NN vs. Tree 0.0 0.2 0.4 0.6 0.8 1.0 Sign Accuracy Majority Sign Baseline Controls Meta-Features Controls + Meta-Features Robust Meta-Features Controls + Robust Meta-Features [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Leave-one-dataset-out sign accuracy for the three main comparisons. Points show how often each feature set predicts the correct held-out gap direction. Error bars show 95% bootstrap confidence intervals. Higher values indicate better routing accuracy. feature signal. For the comparison of the two foundation models currently leading the TabArena benchmark, median attribute concentration survives the robust… view at source ↗

read the original abstract

With the rise of tabular foundation models alongside traditional models still performing well on many tasks, choosing the right model for a tabular dataset remains difficult. We investigate whether dataset meta-features can explain performance gaps between model families on tabular prediction tasks. Using the TabArena benchmark results, we analyze dataset-level performance gaps and relate them to model-agnostic dataset descriptors. After strict statistical tests with false discovery control, we find that (1) for neural network vs. tree gaps, no meta-feature survives false discovery control, (2) for non-foundation vs. foundation model gaps, one association is robust but does not generalize when tested in leave-one-dataset-out prediction, and (3) for TabICLv2 vs. TabPFN-2.6, one robust association also improves held-out prediction. Furthermore, we conduct a leave-one-dataset-out analysis and find that meta-feature predictors fail to improve meaningfully over a simple baseline. Overall, our results show the heterogeneity of tabular datasets and that global meta-feature approaches are not robust enough to offer explanations on the 51 TabArena datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main result is that standard meta-features show almost no robust links to model-family performance gaps on the 51 TabArena datasets once FDR control is applied, and they add little in leave-one-dataset-out prediction.

read the letter

The core finding is that global meta-feature approaches do not hold up for explaining why one model family beats another on tabular data. After FDR correction, no meta-feature survives for neural net versus tree gaps, only one weak link appears for foundation versus non-foundation models, and even that fails to generalize in cross-dataset prediction. The leave-one-dataset-out tests show meta-feature models barely beat a simple baseline.

What stands out is the disciplined use of public benchmark results and proper multiple-testing correction. The authors report the exact pattern of surviving associations and the quantitative prediction outcomes, which gives a clearer picture than earlier work that often stopped at uncorrected correlations.

The soft spot is sample size. Fifty-one datasets is modest once you correct across dozens of meta-features; moderate associations can easily disappear under FDR. The chosen meta-features are model-agnostic and standard, but they may simply miss the dataset properties that actually drive the gaps. The paper treats the TabArena performance numbers as fixed, which is fair for this analysis but leaves open whether noise in those estimates affects the conclusions.

This work is aimed at people building or evaluating meta-learning pipelines for tabular model selection. It supplies a concrete negative benchmark result rather than a new method. A serious editor should send it to review because the statistical controls are appropriate for the claim being made and the data source is public.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes performance gaps between model families (NN vs. trees, non-foundation vs. foundation models, TabICLv2 vs. TabPFN) on the 51 TabArena tabular datasets using model-agnostic meta-features. It applies FDR-controlled statistical tests to identify associations and conducts leave-one-dataset-out validation to assess whether meta-feature predictors can explain gaps better than a simple baseline. The central claims are that no meta-features survive FDR for NN-tree gaps, one association appears for foundation-model gaps but fails to generalize in LODO, one robust association improves prediction for TabICLv2 vs. TabPFN, and overall meta-feature predictors do not meaningfully outperform the baseline, indicating high dataset heterogeneity and limited robustness of global meta-feature approaches.

Significance. If the negative results hold after addressing power and coverage concerns, the work would usefully demonstrate the limitations of standard meta-features for explaining tabular model differences, reinforcing the value of the TabArena benchmark and strict controls (FDR, LODO) for producing reliable negative findings. This could shift research away from global meta-feature explanations toward more localized or structural descriptors.

major comments (2)

[Abstract and Results] Abstract and Results: the central negative claim that 'no meta-feature survives false discovery control' for NN-vs-tree gaps (and the broader conclusion on lack of robustness) is load-bearing, yet with n=51 the manuscript does not report achieved statistical power or minimum detectable correlation after FDR correction across the tested meta-features; moderate associations (|r|≈0.35) could plausibly go undetected.
[Leave-one-dataset-out analysis] Leave-one-dataset-out analysis: the claim that meta-feature predictors 'fail to improve meaningfully over a simple baseline' is central to the overall conclusion, but the manuscript does not specify the exact baseline model, the meta-feature set size, or the quantitative gap in predictive performance (e.g., R² or MAE differences), making it impossible to judge whether the failure is decisive or merely modest.

minor comments (2)

[Methods] The exact definitions and selection criteria for the model-agnostic meta-features should be stated explicitly (perhaps in a dedicated table) to allow readers to assess coverage of potential confounders such as distributional or structural properties.
[Data and Experimental Setup] Clarify whether the TabArena performance estimates include variance or confidence intervals, as stability of these estimates is assumed in the LODO evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater statistical transparency. We agree that additional details on power and baseline specification will improve the manuscript and will incorporate them in revision. These changes address the concerns without altering our core negative findings on the limited explanatory power of meta-features.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results: the central negative claim that 'no meta-feature survives false discovery control' for NN-vs-tree gaps (and the broader conclusion on lack of robustness) is load-bearing, yet with n=51 the manuscript does not report achieved statistical power or minimum detectable correlation after FDR correction across the tested meta-features; moderate associations (|r|≈0.35) could plausibly go undetected.

Authors: We acknowledge the value of reporting achieved power for interpreting our negative results. While FDR control was our primary safeguard against false positives, we will add a post-hoc power analysis in the revised Results section. This will quantify the minimum detectable correlation (e.g., for |r| = 0.3 and 0.4) given n=51 and the number of meta-features tested under FDR. The addition will clarify the strength of evidence for the absence of robust associations. revision: yes
Referee: [Leave-one-dataset-out analysis] Leave-one-dataset-out analysis: the claim that meta-feature predictors 'fail to improve meaningfully over a simple baseline' is central to the overall conclusion, but the manuscript does not specify the exact baseline model, the meta-feature set size, or the quantitative gap in predictive performance (e.g., R² or MAE differences), making it impossible to judge whether the failure is decisive or merely modest.

Authors: We will revise the manuscript to explicitly define the baseline (a constant predictor using the mean gap from training folds), state the exact number of meta-features employed, and report the quantitative performance gaps (including R² and MAE differences) between the meta-feature model and baseline in the LODO experiments. These details will be added to the Methods and Results to allow precise evaluation of the improvement magnitude. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical analysis of external benchmarks

full rationale

The paper reports statistical associations (or their absence) between precomputed TabArena performance gaps and a fixed set of model-agnostic meta-features on 51 datasets. All load-bearing steps are direct applications of standard tests (FDR-controlled correlations, leave-one-dataset-out regression) to external data; no equation, parameter fit, or claim reduces to a self-definition or to a prior result whose only support is a self-citation. The negative findings on meta-feature explanatory power are therefore falsifiable outcomes of the chosen data and tests rather than artifacts of the analysis construction itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the accuracy of the TabArena performance numbers and the assumption that the chosen meta-features plus standard statistical tests are sufficient to detect explanatory relationships if they exist.

axioms (2)

domain assumption TabArena benchmark results provide stable, unbiased estimates of model performance on the included datasets.
All gap analyses and prediction checks are performed on these results.
domain assumption The selected model-agnostic meta-features capture the dataset properties relevant to model-family performance differences.
The paper relates performance gaps to these descriptors.

pith-pipeline@v0.9.1-grok · 5727 in / 1291 out tokens · 46038 ms · 2026-06-29T14:30:40.718138+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

ISBN 978-1-4612-4380-9. doi: 10.1007/ 978-1-4612-4380-9 41. URL https://doi.org/ 10.1007/978-1-4612-4380-9_41. Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Lar- roy, P., Li, M., and Smola, A. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data, March 2020. URL http://arxiv.org/abs/ 2003.06505. arXiv:2003.06505 [stat]. Erickson, N., Pur...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-1-4612-4380-9_41 2020
[2]

URL https://dl

doi: 10.1145/2487575.2487579. URL https://dl. acm.org/doi/10.1145/2487575.2487579. Ma, J., Thomas, V ., Hosseinzadeh, R., Labach, A., Kamkari, H., Cresswell, J. C., Golestan, K., Yu, G., Caterini, A. L., and V olkovs, M. TabDPT: Scaling Tabular Foundation Models on Real Data, 2024. URL https://arxiv. org/abs/2410.18164. Version Number: 3. McElfresh, D., K...

work page doi:10.1145/2487575.2487579 2024

[1] [1]

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

ISBN 978-1-4612-4380-9. doi: 10.1007/ 978-1-4612-4380-9 41. URL https://doi.org/ 10.1007/978-1-4612-4380-9_41. Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Lar- roy, P., Li, M., and Smola, A. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data, March 2020. URL http://arxiv.org/abs/ 2003.06505. arXiv:2003.06505 [stat]. Erickson, N., Pur...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-1-4612-4380-9_41 2020

[2] [2]

URL https://dl

doi: 10.1145/2487575.2487579. URL https://dl. acm.org/doi/10.1145/2487575.2487579. Ma, J., Thomas, V ., Hosseinzadeh, R., Labach, A., Kamkari, H., Cresswell, J. C., Golestan, K., Yu, G., Caterini, A. L., and V olkovs, M. TabDPT: Scaling Tabular Foundation Models on Real Data, 2024. URL https://arxiv. org/abs/2410.18164. Version Number: 3. McElfresh, D., K...

work page doi:10.1145/2487575.2487579 2024