pith. machine review for the scientific record. sign in

arxiv: 2605.06343 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

Mind the Gap? A Distributional Comparison of Real and Synthetic Priors for Tabular Foundation Models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:51 UTC · model grok-4.3

classification 💻 cs.AI
keywords tabular foundation modelssynthetic priorsdistributional comparisonTabICLpre-training datageneralizationdata coveragetabular data
0
0 comments X

The pith

The distributional gap between synthetic and real tabular data does not affect TabICL generalization performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper compares three pre-training corpora for tabular foundation models: web-scraped tables from T4, curated tables from TabFM, and the synthetic TabICL prior. It characterizes each using aggregate features over whole tables, columns, and correlations, then measures differences with discriminator AUCs and k-NN coverage. The synthetic prior sits in a narrow region of real table space that hyperparameter tuning across more than 86 thousand configurations cannot expand. Real curated and web-scraped corpora turn out broadly interchangeable in this feature space. The central finding is that this clear distributional mismatch produces no detectable effect on downstream performance when judged either by feature proximity or the model's own internal representations.

Core claim

The TabICL synthetic prior occupies a narrow region of the space of real tables that cannot be closed by optimising prior hyper-parameters across more than 86 thousand configurations. Curated and web-scraped corpora are broadly interchangeable on a distributional level in feature space. Surprisingly, the distributional gap between synthetic pre-training data and real tables has a clearly detectable effect on performance under neither feature-based proximity measures or TabICL's own internal representations, suggesting that coverage of the real-data distribution is not the primary driver of TabICL's generalisation.

What carries the argument

Aggregate features over whole tables, columns and correlations, compared through discriminator AUCs and k-NN coverage metrics.

Load-bearing premise

The chosen aggregate features over whole tables, columns, and correlations capture the distributional aspects that actually drive model performance and generalization on downstream tasks.

What would settle it

Retrain TabICL on a prior whose aggregate-feature distribution is deliberately matched to real tables and measure whether downstream task accuracy rises relative to the original synthetic prior.

Figures

Figures reproduced from arXiv: 2605.06343 by Alex O. Davies, Nirav Ajmeri, Telmo de Menezes e Silva Filho.

Figure 1
Figure 1. Figure 1: PCA projections of cumulative histograms for the Curated, Web-Scraped and Synthetic view at source ↗
Figure 2
Figure 2. Figure 2: PCA projections of cumulative histograms for the curated FM dataset, the web-scraped T4 view at source ↗
Figure 3
Figure 3. Figure 3: Our ablation study over the number of neighbours view at source ↗
Figure 4
Figure 4. Figure 4: Feature importances, assessed from performance gains through tree splits, for each pairwise view at source ↗
Figure 5
Figure 5. Figure 5: Example visualisations of our aggregate column histogram features. Left: Mean bin values view at source ↗
Figure 6
Figure 6. Figure 6: Recall and Precision for optimising the ICL prior to match the FM and T4 datasets, towards view at source ↗
Figure 7
Figure 7. Figure 7: Cumulative coverage curves for ICL over T4 (left) and T4 over ICL (right). view at source ↗
Figure 8
Figure 8. Figure 8: A histogram and a cumulative histogram of AUC values over our grid search. view at source ↗
Figure 9
Figure 9. Figure 9: Examples of cumulative histograms over columns randomly selected from the TALENT view at source ↗
Figure 10
Figure 10. Figure 10: k-NN histograms (5 neighbours) calculated for ICL against itself and the TALENT bench view at source ↗
Figure 11
Figure 11. Figure 11: k-NN histograms (5 neighbours) calculated for ICL against itself and the TALENT bench view at source ↗
Figure 12
Figure 12. Figure 12: k-NN histograms (5 neighbours) calculated for ICL against itself and the TALENT bench view at source ↗
Figure 13
Figure 13. Figure 13: The average rank of TabICL on TabArena, scattered against the average k-NN distance for view at source ↗
Figure 14
Figure 14. Figure 14: The mean relative AUC of TabICL on TabArena, scattered against the average k-NN view at source ↗
Figure 15
Figure 15. Figure 15: The worst-case relative AUC of TabICL on TabArena, scattered against the average k-NN view at source ↗
read the original abstract

Tabular foundation models are pre-trained on one of three classes of corpus: curated datasets drawn from benchmark repositories, tables harvested at scale from the web, or synthetic tables sampled from a parametric generative prior. Despite the centrality of pre-training data to model performance, little is known about how these corpora relate to one another in distribution, and the impact this has on downstream performance. In this work we take three canonical, archetypal datasets used to train tabular foundation models; the T4 dataset represents web-scraped corpora, the TabFM dataset curated tables from Kaggle, and the TabICL dataset as the only well-used synthetic prior with publicly available parameters. We characterise each corpus using aggregate features over whole tables, columns and correlations, and compare them using discriminator AUCs and k-NN coverage metrics. We find that the TabICL synthetic prior occupies a narrow region of the space of real tables, that this mismatch cannot be closed by optimising prior hyper-parameters across more than 86 thousand configurations, and that curated and web-scraped corpora are broadly interchangeable on a distributional level in feature space. Surprisingly, the distributional gap between synthetic pre-training data and real tables has a clearly detectable effect on performance under neither feature-based proximity measures or TabICL's own internal representations, suggesting that coverage of the real-data distribution is not the primary driver of TabICL's generalisation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents an empirical analysis comparing the distributional characteristics of three pre-training datasets for tabular foundation models: T4 (web-scraped), TabFM (curated from Kaggle), and TabICL (synthetic with public parameters). By extracting aggregate features from entire tables, individual columns, and correlations, and employing discriminator AUCs and k-NN coverage metrics, the authors conclude that the synthetic TabICL prior covers only a narrow subset of the real table distribution, a gap that persists despite optimizing the prior's hyperparameters over more than 86,000 configurations. Real corpora (curated and web-scraped) are found to be largely interchangeable in feature space. A key surprising result is that this distributional mismatch does not manifest in performance differences when using feature-based proximity or TabICL's internal representations, leading to the suggestion that real-data coverage is not the main factor in TabICL's generalization capabilities.

Significance. If the results hold, this work is significant for tabular foundation model research because it empirically challenges the assumption that closer distributional coverage of real data is essential for generalization. The extensive hyperparameter optimization (over 86k configurations), use of multiple comparison metrics, and public release of TabICL prior parameters are strengths that support reproducibility and allow others to build on the analysis. The finding could usefully redirect attention from data curation efforts toward other potential drivers of performance such as model architecture or optimization procedures.

major comments (1)
  1. [Abstract] Abstract: the central claim that the distributional gap has 'no detectable effect on performance' under feature-based proximity measures or TabICL's internal representations (and thus that coverage is not the primary driver of generalization) is load-bearing and rests on the sufficiency of the chosen aggregate features over whole tables, columns, and correlations. If these features omit higher-order dependencies, conditional distributions, or other properties actually used by TabICL, the null performance result could reflect insensitive metrics rather than irrelevance of coverage. The manuscript should include a validation step showing that these features predict or correlate with downstream task performance differences.
minor comments (1)
  1. The abstract states 'more than 86 thousand configurations' without the exact count or the ranges of the optimized hyperparameters; adding these details would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The major comment highlights a valid concern about the sufficiency of our chosen metrics for supporting the central claim. We respond point by point below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the distributional gap has 'no detectable effect on performance' under feature-based proximity measures or TabICL's internal representations (and thus that coverage is not the primary driver of generalization) is load-bearing and rests on the sufficiency of the chosen aggregate features over whole tables, columns, and correlations. If these features omit higher-order dependencies, conditional distributions, or other properties actually used by TabICL, the null performance result could reflect insensitive metrics rather than irrelevance of coverage. The manuscript should include a validation step showing that these features predict or correlate with downstream task performance differences.

    Authors: We agree that aggregate features over tables, columns, and correlations may not exhaustively capture all higher-order dependencies or conditional distributions. However, our performance analysis is not limited to these hand-crafted features: we also evaluate proximity using TabICL's internal representations, which are produced by the model itself and therefore encode precisely the dependencies and properties that TabICL uses for its own predictions. The absence of a detectable performance effect even under this model-specific metric provides independent support for the claim that coverage of the real-data distribution is not the primary driver of generalization. We will add the requested validation step in the revised manuscript by reporting the correlation between our aggregate features and downstream task performance differences across a range of benchmark tasks, thereby demonstrating the sensitivity of the chosen metrics. revision: yes

Circularity Check

0 steps flagged

No circularity detected in empirical dataset comparison

full rationale

The paper conducts an empirical characterization of three tabular corpora (T4, TabFM, TabICL) via aggregate features on whole tables, columns, and correlations, followed by direct comparisons using discriminator AUCs, k-NN coverage, hyperparameter sweeps over 86k configurations, and downstream performance checks against feature proximity and TabICL internal representations. No equations, fitted parameters renamed as predictions, self-citations serving as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the derivation chain. All claims reduce to observable measurements on external datasets rather than self-referential definitions or constructions, making the analysis self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the domain assumption that the selected aggregate statistics adequately represent the properties relevant to tabular model training. No free parameters are fitted to produce the main claims, and no new entities are postulated.

axioms (1)
  • domain assumption Aggregate features over tables, columns, and correlations sufficiently capture the distributional properties that matter for downstream model performance.
    These features are used to characterize and compare the three corpora and to assess proximity to real data.

pith-pipeline@v0.9.0 · 5562 in / 1299 out tokens · 57836 ms · 2026-05-08T09:51:20.063201+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 20 canonical work pages · 1 internal anchor

  1. [1]

    OPTUNA: a next-generation hyperparameter optimization framework

    T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. Optuna: A Next-generation Hyper- parameter Optimization Framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, pages 2623–2631, New York, NY , USA, July 2019. Association for Computing Machinery. ISBN 978-1-4503-6201-6. doi: 10.1145/32925...

  2. [2]

    Akoglu, M

    L. Akoglu, M. McGlohon, and C. Faloutsos. RTM: Laws and a recursive generator for weighted time-evolving graphs. InProceedings - IEEE international conference on data mining, ICDM, pages 701–706, 2008. ISBN 978-0-7695-3502-9. doi: 10.1109/ICDM.2008.123

  3. [3]

    S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth. The UCI KDD archive of large data sets for data mining research and experimentation.SIGKDD Explor. Newsl., 2(2):81–85, Dec. 2000. ISSN 1931-0145. doi: 10.1145/380995.381030. URL https://dl.acm.org/doi/10.1145/ 380995.381030

  4. [4]

    A. Borji. Pros and cons of GAN evaluation measures.Computer Vision and Image Under- standing, 179:41–65, Feb. 2019. ISSN 1077-3142. doi: 10.1016/j.cviu.2018.10.009. URL https://www.sciencedirect.com/science/article/pii/S1077314218304272

  5. [5]

    F. d. Breejen, S. Bae, S. Cha, and S.-Y . Yun. Fine-tuned In-Context Learning Transformers are Excellent Tabular Data Classifiers, Jan. 2025. URL http://arxiv.org/abs/2405.13396. arXiv:2405.13396 [cs]

  6. [6]

    L. Breiman. Random Forests.Machine Learning, 45(1):5–32, Oct. 2001. ISSN 1573-0565. doi: 10.1023/A:1010933404324. URLhttps://doi.org/10.1023/A:1010933404324

  7. [7]

    XGBoost: A Scalable Tree Boosting System

    T. Chen and C. Guestrin. XGBoost: A Scalable Tree Boosting System. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY , USA, Aug. 2016. Association for Computing Machinery. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2939785. URL https://dl.acm.org/ doi/10.1145/293967...

  8. [8]

    Cortes and V

    C. Cortes and V . Vapnik. Support-vector networks.Machine Learning, 20(3):273–297, Sept

  9. [9]

    Machine Learning , year =

    ISSN 1573-0565. doi: 10.1007/BF00994018. URL https://doi.org/10.1007/ BF00994018

  10. [10]

    A. V . Dorogush, V . Ershov, and A. Gulin. CatBoost: gradient boosting with categorical features support, Oct. 2018. URLhttp://arxiv.org/abs/1810.11363. arXiv:1810.11363 [cs]

  11. [11]

    TabLib: A Dataset of 627M Tables with Context, October 2023

    G. Eggert, K. Huo, M. Biven, and J. Waugh. TabLib: A Dataset of 627M Tables with Context, Oct. 2023. URLhttp://arxiv.org/abs/2310.07875. arXiv:2310.07875 [cs]

  12. [12]

    AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

    N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola. AutoGluon- Tabular: Robust and Accurate AutoML for Structured Data, Mar. 2020. URLhttp://arxiv. org/abs/2003.06505. arXiv:2003.06505 [stat]

  13. [13]

    Gardner, J

    J. Gardner, J. C. Perdomo, and L. Schmidt. Large Scale Transfer Learning for Tabular Data via Language Modeling. InAdvances in Neural Information Pro- cessing Systems, volume 37, pages 45155–45205, Dec. 2024. doi: 10.52202/ 079017-1435. URL https://proceedings.neurips.cc/paper_files/paper/2024/ hash/4fd5cfd2e31bebbccfa5ffa354c04bdc-Abstract-Conference.html

  14. [14]

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative Adversarial Nets. InAdvances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips. cc/paper/2014/hash/f033ed80deb0234979a61f95710dbe25-Abstract.html. 10

  15. [15]

    Helli, D

    K. Helli, D. Schnurr, N. Hollmann, S. Müller, and F. Hutter. Drift-resilient TabPFN: in-context learning temporal distribution shifts on tabular data. InProceedings of the 38th International Conference on Neural Information Processing Systems, volume 37 ofNIPS ’24, pages 98742– 98781, Red Hook, NY , USA, Dec. 2024. Curran Associates Inc. ISBN 979-8-3313-1438-5

  16. [16]

    Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

    N. Hollmann, S. Müller, L. Purucker, A. Krishnakumar, M. Körfer, S. B. Hoo, R. T. Schirrmeis- ter, and F. Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, Jan. 2025. ISSN 1476-4687. doi: 10.1038/s41586-024-08328-6. URL https://www.nature.com/articles/s41586-024-08328-6 . Publisher: Nature Publish- ing Group

  17. [17]

    Jiang, S.-Y

    J.-P. Jiang, S.-Y . Liu, H.-R. Cai, Q.-L. Zhou, and H.-J. Ye. Representation Learning for Tabular Data: A Comprehensive Survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–20, 2026. ISSN 1939-3539. doi: 10.1109/TPAMI.2026.3657217. URL https://ieeexplore.ieee.org/abstract/document/11369258

  18. [18]

    G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y . Liu. LightGBM: a highly efficient gradient boosting decision tree. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 3149–3157, Red Hook, NY , USA, Dec. 2017. Curran Associates Inc. ISBN 978-1-5108-6096-4. URLhttps: //dl.acm.org/...

  19. [19]

    Kynkäänniemi, T

    T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila. Improved Precision and Recall Metric for Assessing Generative Models. InAdvances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips. cc/paper/2019/hash/0234c510bc6d908b28c70ff313743079-Abstract.html

  20. [20]

    Liu, H.-R

    S.-Y . Liu, H.-R. Cai, Q.-L. Zhou, H.-H. Yin, T. Zhou, J.-P. Jiang, and H.-J. Ye. Talent: A Tabular Analytics and Learning Toolbox.Journal of Machine Learning Research, 26(226):1–16, 2025. ISSN 1533-7928. URLhttp://jmlr.org/papers/v26/25-0512.html

  21. [21]

    M. F. Naeem, S. J. Oh, Y . Uh, Y . Choi, and J. Yoo. Reliable Fidelity and Diversity Metrics for Generative Models. InProceedings of the 37th International Conference on Machine Learning, pages 7176–7185. PMLR, Nov. 2020. URL https://proceedings.mlr.press/ v119/naeem20a.html

  22. [22]

    J. Qu, D. Holzmüller, G. Varoquaux, and M. L. Morvan. TabICL: A Tabular Foundation Model for In-Context Learning on Large Data. InProceedings of the 42nd International Conference on Machine Learning, pages 50817–50847. PMLR, Oct. 2025. URL https: //proceedings.mlr.press/v267/qu25d.html

  23. [23]

    H. O. Quinn, M. Sedky, J. Francis, and M. Streeton. Literature Review of Explain- able Tabular Data Analysis.Electronics, 13(19), Sept. 2024. ISSN 2079-9292. doi: 10.3390/electronics13193806. URLhttps://www.mdpi.com/2079-9292/13/19/3806

  24. [24]

    M. S. M. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly. Assessing Generative Models via Precision and Recall. InAdvances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/ hash/f7696a9b362ac5a51c3dc8f098b73923-Abstract.html

  25. [25]

    Tabular data: Deep learning is not all you need.Informa- tion Fusion, 81:84–90, May 2022

    R. Shwartz-Ziv and A. Armon. Tabular data: Deep learning is not all you need.Information Fusion, 81:84–90, May 2022. ISSN 1566-2535. doi: 10.1016/j.inffus.2021.11.011. URL https://www.sciencedirect.com/science/article/pii/S1566253521002360

  26. [26]

    Vanschoren, J

    J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. OpenML: networked science in machine learning.SIGKDD Explor. Newsl., 15(2):49–60, June 2014. ISSN 1931-0145. doi: 10.1145/2641190.2641198. URL https://dl.acm.org/doi/10.1145/2641190.2641198

  27. [27]

    X. Wen, H. Zhang, S. Zheng, W. Xu, and J. Bian. From Supervised to Generative: A Novel Paradigm for Tabular Deep Learning with Large Language Models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, pages 3323–3333, New York, NY , USA, Aug. 2024. Association for Computing Machinery. ISBN 979-8-4007-0490-1. d...

  28. [28]

    col_name:col_val

    X. Zhang, D. C. Maddix, J. Yin, N. Erickson, A. F. Ansari, B. Han, S. Zhang, L. Akoglu, C. Faloutsos, M. W. Mahoney, C. Hu, H. Rangwala, G. Karypis, and B. Wang. Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models, Oct. 2025. URL http://arxiv. org/abs/2510.21204. arXiv:2510.21204 [cs]

  29. [29]

    Limix: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025

    X. Zhang, G. Ren, H. Yu, H. Yuan, H. Wang, J. Li, J. Wu, L. Mo, L. Mao, M. Hao, N. Dai, R. Xu, S. Li, T. Zhang, Y . He, Y . Wang, Y . Zhang, Z. Xu, D. Li, F. Gao, H. Zou, J. Liu, J. Liu, J. Xu, K. Cheng, K. Li, L. Zhou, Q. Li, S. Fan, X. Lin, X. Han, X. Li, Y . Lu, Y . Xue, Y . Jiang, Z. Wang, Z. Wang, and P. Cui. LimiX: Unleashing Structured-Data Modelin...

  30. [30]

    extended it, with Mitra offering the most principled treatment by selecting mixture components via a formal diversity-and-performance criterion. Notably, only three of the seven priors surveyed (TabForestPFN [5], Drift-Resilient TabPFN [14], and TabICL v1 [21]) have released their generation code; the remainder describe their priors in varying levels of p...

  31. [31]

    Selection: fill remaining slots via tournament selection with k= 3 competitors; the individual with the lowest AUC wins

  32. [32]

    Crossover: pairs of selected parents produce offspring by sampling each parameter uni- formly from either parent

  33. [33]

    Con- tinuous parameters receive Gaussian noise N(0,0.1·(hi−lo)) ; integer parameters use N(0,(hi−lo)/6); categorical parameters are resampled uniformly

    Mutation: each parameter is independently mutated with probability pmut = 0.5 . Con- tinuous parameters receive Gaussian noise N(0,0.1·(hi−lo)) ; integer parameters use N(0,(hi−lo)/6); categorical parameters are resampled uniformly. ResultsNeither the Bayesian nor genetic optimisation is able to significantly decrease the AUC of a discriminator, despite e...