Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap
Pith reviewed 2026-05-20 12:54 UTC · model grok-4.3
The pith
Six tabular foundation models are nearly redundant, so ensembles add at most 0.18% accuracy at 253 times the cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is 0.961, close enough to 1 that any convex combination is bounded above. Benchmarking six ensemble strategies over six TFMs on 153 OpenML classification tasks shows the best ensemble, two-level cascade stacking, buys +0.18% accuracy over the strongest single TFM at 253 times the compute. Stacking with a logistic-regression meta-learner improves accuracy and ROC-AUC but ranks worst on log-loss because it sharpens class boundaries and destroys calibration. A Friedman and Nemenyi analysis places three ensembles and the best base TFM in a single equivalence group; three other ensembles are significantly worse.
What carries the argument
Mean pairwise Q-statistic that quantifies agreement between model predictions and thereby bounds the diversity available for ensembling.
Load-bearing premise
The six chosen TFMs and 153 OpenML tasks sufficiently represent the space of current tabular foundation models and tasks.
What would settle it
A new TFM whose predictions show a mean pairwise Q-statistic below 0.85 with the existing six on a comparable set of tasks would falsify the near-redundancy claim.
read the original abstract
Tabular foundation models (TFMs) now match or beat tuned gradient-boosted trees on a growing fraction of tabular tasks, but no single TFM wins on every dataset. Ensembling is the go to fix here, and it works less well than expected. Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is $0.961$, close enough to $1$ that any convex combination is bounded above. We benchmark six ensemble strategies over six TFMs on 153 OpenML classification tasks. The best ensemble, two-level cascade stacking, buys $+0.18\%$ accuracy over the strongest single TFM at $253\times$ the compute. A Friedman and Nemenyi analysis places three ensembles and the best base TFM in a single equivalence group; three other ensembles are significantly \emph{worse} than the best base. Stacking with a logistic-regression meta-learner is the most striking case: competitive accuracy and ROC-AUC, the worst log-loss rank among the ensembles. The meta-learner improves accuracy by sharpening class boundaries, which destroys calibration. We recommend greedy selection as the practical default.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that six modern tabular foundation models form a near-redundant pool (mean pairwise Q-statistic of 0.961), such that ensembles yield negligible gains. Benchmarking six ensemble strategies on 153 OpenML classification tasks shows the best performer (two-level cascade stacking) improves accuracy by only +0.18% over the strongest single TFM at 253× compute cost. Friedman-Nemenyi analysis places three ensembles and the best base model in one equivalence group, while three ensembles perform significantly worse; stacking with logistic regression is highlighted as creating a calibration trap by improving accuracy at the expense of log-loss.
Significance. If the empirical findings hold, the work has clear practical value for the tabular ML community by documenting a diversity ceiling among current TFMs and recommending greedy selection over complex ensembling. Credit is due for the direct benchmarking against published TFMs on 153 tasks and the use of Friedman-Nemenyi post-hoc tests to establish equivalence groups rather than relying on raw averages.
major comments (3)
- [Experimental Setup] The central generalization that the observed Q-statistic of 0.961 demonstrates an inherent diversity ceiling (rather than a property of this particular sample) rests on the representativeness of the six chosen TFMs and 153 OpenML tasks; no explicit selection criteria or coverage argument is provided to support extension beyond this pool.
- [Results] No error bars, standard deviations, or confidence intervals accompany the reported accuracy gains, Q-statistic, or Friedman-Nemenyi ranks; this weakens the claim that the +0.18% improvement and equivalence-group findings are robust rather than sensitive to sampling variability.
- [Results] The manuscript does not verify that post-hoc selection of the best ensemble among the six tested strategies did not inflate the reported gains; without a pre-specified primary ensemble or correction for multiple comparisons, the conclusion that ensembles are bounded above is harder to interpret.
minor comments (2)
- [Abstract] The abstract states that three ensembles are significantly worse but does not name them or report the exact Nemenyi critical differences or p-values.
- [Experimental Setup] Hyperparameter details for the base TFMs and the meta-learners in the stacking variants are not provided, limiting reproducibility of the calibration-trap observation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We address each major comment below and describe the corresponding revisions planned for the manuscript.
read point-by-point responses
-
Referee: [Experimental Setup] The central generalization that the observed Q-statistic of 0.961 demonstrates an inherent diversity ceiling (rather than a property of this particular sample) rests on the representativeness of the six chosen TFMs and 153 OpenML tasks; no explicit selection criteria or coverage argument is provided to support extension beyond this pool.
Authors: The six TFMs were selected as the leading publicly available models with published strong performance on tabular tasks at the time of writing. The 153 tasks are the standard OpenML-CC18 classification benchmark used across multiple prior tabular studies for comparability. We will add an explicit subsection on selection criteria for both models and tasks, including a short coverage argument addressing dataset size, dimensionality, and class balance to better support generalization of the diversity-ceiling observation. revision: yes
-
Referee: [Results] No error bars, standard deviations, or confidence intervals accompany the reported accuracy gains, Q-statistic, or Friedman-Nemenyi ranks; this weakens the claim that the +0.18% improvement and equivalence-group findings are robust rather than sensitive to sampling variability.
Authors: We agree that variability estimates would improve robustness. The revised manuscript will include bootstrap 95% confidence intervals for the reported accuracy gains and mean Q-statistic, as well as standard deviations of the average ranks across the 153 tasks. revision: yes
-
Referee: [Results] The manuscript does not verify that post-hoc selection of the best ensemble among the six tested strategies did not inflate the reported gains; without a pre-specified primary ensemble or correction for multiple comparisons, the conclusion that ensembles are bounded above is harder to interpret.
Authors: The study evaluated a range of strategies to determine whether any could meaningfully exceed single-model performance. The key result is that even the strongest observed ensemble yields only +0.18% and that three ensembles are significantly worse. To address the post-hoc concern we will designate two-level cascade stacking as the primary ensemble in the revised text, note the exploratory status of the remaining comparisons, and apply a Bonferroni correction to the Friedman-Nemenyi post-hoc tests. revision: partial
Circularity Check
No significant circularity in empirical benchmarking study
full rationale
The paper reports direct empirical measurements: pairwise Q-statistics computed from the six TFMs' predictions on 153 external OpenML tasks, plus accuracy/ROC-AUC/log-loss ranks for six ensemble strategies versus single models. No equations, fitted parameters, or derivations are presented that reduce to the inputs by construction; the Q-statistic of 0.961 and the +0.18% ensemble gain are observed quantities, not self-defined or statistically forced outputs. The study relies on published TFMs and public datasets rather than any self-citation chain or ansatz, satisfying the criteria for a self-contained empirical result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mean pairwise Q-statistic near 1 implies convex combinations cannot exceed the best base model by more than a small margin
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is 0.961... any convex combination is bounded above.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The best ensemble, two-level cascade stacking, buys +0.18% accuracy... at 253× the compute.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
TabPFN: A transformer that solves small tabular classification problems in a second
Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[2]
Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025
Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025
work page 2025
-
[3]
TabICL: A tabular foundation model for in-context learning on large data
Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. TabICL: A tabular foundation model for in-context learning on large data. InInternational Conference on Machine Learning (ICML), 2025
work page 2025
-
[4]
Xiyuan Zhang, Danielle C. Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W. Mahoney, Cuixiong Hu, Huzefa Rangwala, George Karypis, and Bernie Wang. Mitra: Mixed synthetic priors for enhancing tabular foundation models.arXiv preprint arXiv:2510.21204, 2025
-
[5]
Orion-Bix: Bi-Axial attention for tabular in-context learning
Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-Bix: Bi-Axial attention for tabular in-context learning. InProceedings of the ACM Web Conference 2026, WWW ’26, New York, NY , USA,
work page 2026
-
[6]
Association for Computing Machinery
-
[7]
TabDPT: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,
Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Maksims V olkovs, and Anthony L Caterini. TabDPT: Scaling tabular foundation models. arXiv preprint arXiv:2410.18164, 2024
-
[8]
Carte: Pretraining and transfer for tabular learning, 2024
Myung Jun Kim, Léo Grinsztajn, and Gaël Varoquaux. Carte: Pretraining and transfer for tabular learning, 2024
work page 2024
-
[9]
Anurag Garg, Muhammad Ali, Noah Hollmann, Lennart Purucker, Samuel Müller, and Frank Hutter. Real- TabPFN: Improving tabular foundation models via continued pre-training with real-world data.arXiv preprint arXiv:2507.03971, 2025
-
[10]
TabArena: A living benchmark for machine learning on tabular data
Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. TabArena: A living benchmark for machine learning on tabular data. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025
work page 2025
-
[11]
Bagging predictors.Machine Learning, 24:123–140, 1996
Leo Breiman. Bagging predictors.Machine Learning, 24:123–140, 1996
work page 1996
-
[12]
Thomas G. Dietterich. Ensemble methods in machine learning. InMultiple Classifier Systems, volume 1857 of Lecture Notes in Computer Science, pages 1–15. Springer, 2000
work page 2000
-
[13]
Ensemble selection from libraries of models
Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble selection from libraries of models. InInternational Conference on Machine Learning (ICML), 2004
work page 2004
-
[14]
Transformers can do bayesian inference
Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do bayesian inference. InInternational Conference on Learning Representations, 2022
work page 2022
-
[15]
Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. In Yingzhen Li, Stephan Mandt, Shipra 6 Ensembling Tabular Foundation Models Agrawal, and Emtiyaz Khan, editors,Proceedings of The 28th International Conference on Artificial Intelligenc...
work page 2025
-
[16]
HAPEns: Hardware-aware post-hoc ensembling for tabular data.arXiv preprint arXiv:2603.10582, 2026
Jannis Maier and Lennart Purucker. HAPEns: Hardware-aware post-hoc ensembling for tabular data.arXiv preprint arXiv:2603.10582, 2026
-
[17]
TabM: Advancing tabular deep learning with parameter- efficient ensembling
Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. TabM: Advancing tabular deep learning with parameter- efficient ensembling. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[18]
Exploring fine-tuning for tabular foundation models
Aditya Tanna, Pratinav Seth, Mohamed Bouadi, and Vinay Kumar Sankarapu. Exploring fine-tuning for tabular foundation models. InProceedings of the ACM Web Conference 2026, WWW ’26, New York, NY , USA, 2026. Association for Computing Machinery
work page 2026
-
[19]
David H. Wolpert. Stacked generalization.Neural Networks, 5(2):241–259, 1992
work page 1992
-
[20]
Kai Ming Ting and Ian H. Witten. Issues in stacked generalization.Journal of Artificial Intelligence Research, 10:271–289, 1999
work page 1999
-
[21]
Ludmila I. Kuncheva and Christopher J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy.Machine Learning, 51(2):181–207, 2003
work page 2003
-
[22]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning (ICML), 2017
work page 2017
-
[23]
Predicting good probabilities with supervised learning
Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, volume 119 ofACM International Conference Proceeding Series, pages 625–632. ACM, 2005
work page 2005
-
[24]
Selective classification for deep neural networks
Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2017
work page 2017
-
[25]
Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. InInternational Conference on Learning Representations (ICLR), 2020
work page 2020
-
[26]
AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data
Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. AutoGluon-Tabular: Robust and accurate AutoML for structured data.arXiv preprint arXiv:2003.06505, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[27]
Simple and scalable predictive uncertainty estimation using deep ensembles
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems (NeurIPS), 2017
work page 2017
-
[28]
Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, and Vinay Kumar Sankarapu. TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025
-
[29]
Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. OpenML benchmarking suites. InAdvances in Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2021
work page 2021
-
[30]
Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, Huai-Hong Yin, Tao Zhou, Jun-Peng Jiang, and Han-Jia Ye. Talent: A tabular analytics and learning toolbox.Journal of Machine Learning Research, 26(226):1–16, 2025
work page 2025
-
[31]
When do neural nets outperform boosted trees on tabular data?, 2023
Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Benjamin Feuer, Chinmay Hegde, Ganesh Ramakrishnan, Micah Goldblum, and Colin White. When do neural nets outperform boosted trees on tabular data?, 2023
work page 2023
-
[32]
TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models
L’eo Grinsztajn, Klemens Floge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jager, Dominik Safaric, Simone Alessi, Adrian Hayler, Mihir Manium, Rose Yu, Felix Jablonski, Shi Bin Hoo, Anurag Garg, Jake Robertson, Magnus Buhler, Vladyslav Moroshan, Lennart Purucker, Clara Cornu, Lilly Charlotte Wehrhahn, Alessandro Bonetto, Bernhard Scholk...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Limix: Unleashing structured- data modeling capability for generalist intelligence
Xingxuan Zhang et al. LimiX: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025
-
[34]
Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025
-
[35]
Glenn W. Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1):1–3, 1950
work page 1950
-
[36]
Milton Friedman. A comparison of alternative tests of significance for the problem of m rankings.The Annals of Mathematical Statistics, 11(1):86–92, 1940
work page 1940
-
[37]
PhD thesis, Princeton University, 1963
Peter Nemenyi.Distribution-Free Multiple Comparisons. PhD thesis, Princeton University, 1963
work page 1963
-
[38]
Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83, 1945
Frank Wilcoxon. Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83, 1945. 7 Ensembling Tabular Foundation Models A Method-name glossary The body and Tables 1 and 3 use compact short-form labels; Figures 3 and 4 render the same methods in long form. Table 2 reconciles the two. Table 2Short-form labels used in prose and Tables 1 and 3...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.