Distributional Regression with Tabular Foundation Models: Evaluating Probabilistic Predictions via Proper Scoring Rules
read the original abstract
Modern tabular foundation models such as TabPFN and TabICL naturally produce full predictive distributions, while the benchmarks used to evaluate them (TabArena, TALENT, and others) still rely almost exclusively on point-estimate metrics (RMSE, $R^2$). This mismatch implicitly rewards machine learning models or pipelines that elicit a good conditional mean while ignoring the quality of the predictive distribution. We make the case for using proper scoring rules for training, fine-tuning, and benchmarking (ranking) of tabular foundation models. Although all strictly proper scoring rules are theoretically equivalent at the population level, they may differ on finite data: We demonstrate analytically and empirically that different scoring rules can induce different inductive biases during finite-sample optimization, leading to different model performance. We validate this finding by running fine-tuning experiments with TabPFN and TabICL using different scoring rules for various data sets, revealing non-trivial interactions between training objectives and evaluation metrics. Our results show that practitioners can adapt tabular foundation models to task-specific scoring objectives, and that the choice of scoring rule can influence model behavior in practice.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Beyond IID: How General Are Tabular Foundation Models, Really?
Tabular foundation models excel on tiny- to medium-sized IID data but are outperformed by traditional tree-based and deep learning models on non-IID, large, and high-dimensional datasets, based on evaluations across 1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.