arxiv: 2605.12292 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: no theorem link

STRABLE: Benchmarking Tabular Machine Learning with Strings

Gioia Blayer , Myung Jun Kim , F\'elix Lefebvre , Lennart Purucker , Alan Arazi , Eilam Shapira , Roi Reichart , Frank Hutter

show 3 more authors

Marine Le Morvan David Holzm\"uller Ga\"el Varoquaux

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:07 UTC · model grok-4.3

classification 💻 cs.LG

keywords tabular learningstring embeddingsmixed data typesbenchmark suitecategorical dataLLM encodersempirical evaluationmodular pipelines

0 comments

The pith

Most real-world tables mixing strings and numbers are categorical-dominant, so advanced tabular models paired with simple string embeddings deliver strong results at low cost, while large language model encoders become competitive only on a

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces STRABLE, a collection of 108 real-world tables that combine string entries with numeric ones across many application areas. It evaluates 445 pipelines that either model strings and numbers jointly in one architecture or encode strings first and then apply a separate tabular learner. The central result is that categorical strings predominate in these tables, making simple embeddings paired with strong tabular models effective and efficient. On the smaller subset of tables dominated by free text, large language model encoders reach competitive accuracy but require careful post-processing whose effects vary by model family. The benchmark produces pipeline rankings that generalize across tables, providing a stable foundation for further work on mixed string-numeric data.

Core claim

STRABLE supplies 108 tables drawn from diverse real-world problems. On this corpus, modular pipelines that encode strings simply and then apply advanced tabular learners outperform or match end-to-end string-numeric architectures for the majority of tables, which turn out to be categorical-dominant. On free-text-dominant tables, large LLM encoders become competitive, yet their success depends on the choice of post-processing step. Pipeline rankings obtained on STRABLE stay close to oracle rankings computed on held-out tables, confirming that the benchmark supports generalizable conclusions about string tabular learning.

What carries the argument

The STRABLE corpus of 108 mixed string-numeric tables together with the systematic comparison of modular encoding-plus-tabular pipelines against end-to-end architectures.

If this is right

For categorical-dominant tables, practitioners can obtain near-optimal accuracy by pairing any strong tabular learner with lightweight string encoders instead of training large joint models.
On free-text-dominant tables, switching to a large language model encoder is worthwhile, but the choice of post-processing layer must be validated because it affects relative performance across encoder families.
Benchmark rankings derived from STRABLE can be trusted to predict which pipelines will perform well on new tables of the same kind.
Future tabular learning research should treat string encoding as a first-class design choice rather than an afterthought.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The categorical-versus-free-text distinction offers a practical rule of thumb for selecting an encoding strategy before training begins.
Existing tabular benchmarks that contain only numeric columns may systematically underestimate the value of simple string handling techniques.
Extending STRABLE with tables that contain more complex string structures such as nested JSON or long documents would test whether the current conclusions continue to hold.

Load-bearing premise

The 108 tables assembled for STRABLE capture the distribution of string and numeric features that appear in typical real-world supervised learning tasks.

What would settle it

A new collection of several dozen mixed tables on which end-to-end string-numeric models consistently and substantially outperform simple-embedding-plus-tabular-learner pipelines would falsify the central empirical recommendation.

Figures

Figures reproduced from arXiv: 2605.12292 by Alan Arazi, David Holzm\"uller, Eilam Shapira, F\'elix Lefebvre, Frank Hutter, Ga\"el Varoquaux, Gioia Blayer, Lennart Purucker, Marine Le Morvan, Myung Jun Kim, Roi Reichart.

**Figure 1.** Figure 1: (a) STRABLE (solid) vs. OpenML (dashed); cardinality and string length aggregate over string columns. (b) Performance for Num-only, Str-only and full table (Num+Str) by learner. (0.5%), with the remaining 49.45% being plain Categoricals. Half (50.55%) consist of modalities that string-excluding and string-flattening benchmarks typically ignore or destroy (e.g.: PMLB ignores Names, Structured Codes, Free Te… view at source ↗

**Figure 2.** Figure 2: Post-processing affects LLM-based embeddings, especially for decoder-only models. Average score across 108 tables and five learners for 7 LM encoders under three post-processing variants. Each panel is one encoder; bars show the mean score under default 30-PCA (blue), standard scaling before 30-PCA (orange), and direct slicing of the first 30 raw embedding dimensions (green). Percentages indicate relative … view at source ↗

**Figure 3.** Figure 3: Critical difference diagram for encoder-learner pipelines. Pipelines’ average rank across the 108 datasets are shown in parentheses; lower is better. Dashed lines are E2E, continuous lines are Modular. Pipelines connected by horizontal bars are not statistically distinguishable at the indicated level (test statistic in Appendix D.1). Modular pipelines cluster at the top of the ranking. Pipelines marked wit… view at source ↗

**Figure 4.** Figure 4: Pareto-optimality plot and benchmark ranking stability. (a) Each point is a pipeline, colored by encoder on the left and by learner on the right. The dotted line is the pareto-optimality frontier. Encoders explain much of the runtime: for a given encoder, performance varies depending on the learner while runtime varies less (aside from tuning or not). Simple and advanced learners benefit differently from v… view at source ↗

**Figure 5.** Figure 5: (a) Kendall-τ correlation between application-specific subsets and the full benchmark (numbers in parentheses show the number of tables per application field). (b) Each row reports Kendall-τ between STRABLE’s ranking and the ranking of the opposite data preparation (e.g., applying feature engineering or missing-value imputation, which STRABLE does not; or removing target transformations and subsampling, wh… view at source ↗

**Figure 6.** Figure 6: Top-10 pipelines per leading string type. Datasets are grouped by their most frequent string type. In the Free Text regime large LLMs enter the top-10 paired with TabPFN-2.5; all other types mirror the global ranking (lightweight encoders at the top paired with TabPFN-2.5, and LM encoders paired with light learners like ExtraTrees). ConTextTab leads the Structured Code panel, plausibly aided by code-rich T… view at source ↗

read the original abstract

Benchmarking tabular learning has revealed the benefit of dedicated architectures, pushing the state of the art. But real-world tables often contain string entries, beyond numbers, and these settings have been understudied due to a lack of a solid benchmarking suite. They lead to new research questions: Are dedicated learners needed, with end-to-end modeling of strings and numbers? Or does it suffice to encode strings as numbers, as with a categorical encoding? And if so, do the resulting tables resemble numerical tabular data, calling for the same learners? To enable these studies, we contribute STRABLE, a benchmarking corpus of 108 tables, all real-world learning problems with strings and numbers across diverse application fields. We run the first large-scale empirical study of tabular learning with strings, evaluating 445 pipelines. These pipelines span end-to-end architectures and modular pipelines, where strings are first encoded, then post-processed, and finally passed to a tabular learner. We find that, because most tables in the wild are categorical-dominant, advanced tabular learners paired with simple string embeddings achieve good predictions at low computational cost. On free-text-dominant tables, large LLM encoders become competitive. Their performance also appears sensitive to post-processing, with differences across LLM families. Finally, we show that STRABLE is a good set of tables to study "string tabular" learning as it leads to generalizable pipeline rankings that are close to the oracle rankings. We thus establish STRABLE as a foundation for research on tabular learning with strings, an important yet understudied area.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STRABLE gives the first large benchmark for tabular ML with strings and shows simple encodings often suffice, but the 'most tables are categorical-dominant' claim lacks verified sampling support.

read the letter

The main takeaway is that STRABLE supplies a needed benchmark of 108 real-world tables mixing strings and numbers, plus results from 445 pipelines. For tables that are mostly categorical, pairing advanced tabular learners with basic string embeddings delivers solid performance at low cost, while LLM encoders become relevant mainly on free-text heavy cases. The study also checks that rankings on STRABLE track oracle performance reasonably well.

Referee Report

2 major / 2 minor

Summary. The paper introduces STRABLE, a benchmark of 108 real-world tables containing both strings and numerical features across diverse fields. It evaluates 445 pipelines combining various string encoders (simple embeddings, categorical encodings, and LLM-based) with tabular learners, finding that advanced tabular models paired with simple string embeddings perform well on categorical-dominant tables at low cost, while large LLM encoders become competitive on free-text-dominant tables. The work also validates that pipeline rankings on STRABLE are close to oracle rankings, positioning the benchmark as a foundation for research on tabular learning with strings.

Significance. If the empirical patterns hold and the tables are representative, this provides a much-needed dedicated benchmark and practical guidance for handling mixed string-numeric data, an understudied but common real-world setting. It highlights efficient, low-cost approaches for the majority of cases and identifies when more expensive LLM encoders add value, potentially influencing both practitioner choices and future model development in tabular ML.

major comments (2)

[Benchmark construction / data collection section] The claim that 'most tables in the wild are categorical-dominant' and the resulting pipeline recommendations rest on the 108 tables being representative, yet the manuscript provides no explicit sampling frame, stratification by domain or string-type ratio, or distributional comparison against reference corpora such as OpenML or Kaggle. This is load-bearing for generalizing the findings beyond the specific benchmark.
[Experimental evaluation / results section] While 445 pipelines are evaluated, details on the exact metrics (e.g., specific loss functions or evaluation measures for each task type) and any statistical significance testing of performance differences are insufficiently described, weakening the robustness of conclusions about when simple embeddings suffice versus when LLMs compete.

minor comments (2)

[Introduction / abstract] Clarify the quantitative thresholds or definitions used to classify tables as 'categorical-dominant' versus 'free-text-dominant' with an explicit criterion or table in the main text.
[Figures and results] Figure captions and legends should include more detail on the exact comparison being shown (e.g., which encoders and learners) to improve standalone readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps strengthen the presentation and robustness of our STRABLE benchmark. We address each major comment below with planned revisions to the manuscript.

read point-by-point responses

Referee: The claim that 'most tables in the wild are categorical-dominant' and the resulting pipeline recommendations rest on the 108 tables being representative, yet the manuscript provides no explicit sampling frame, stratification by domain or string-type ratio, or distributional comparison against reference corpora such as OpenML or Kaggle. This is load-bearing for generalizing the findings beyond the specific benchmark.

Authors: We agree that justifying representativeness is important for generalizing observations about categorical-dominant tables. In the revised manuscript, we will expand the benchmark construction section with a detailed account of our data collection process: tables were sourced from public repositories including Kaggle, UCI, and domain-specific datasets (e.g., healthcare, finance, e-commerce), filtered for mixed string-numeric content with at least one string column and sufficient size for learning tasks. We will add stratification details by domain and string-type ratio (e.g., proportion of free-text vs. categorical strings), along with distributional comparisons such as column-type histograms against OpenML and Kaggle subsets where direct access permits. While a complete probabilistic sampling frame for all real-world tables remains challenging without a universal registry, these additions will better support our claims and the resulting pipeline recommendations. revision: yes
Referee: While 445 pipelines are evaluated, details on the exact metrics (e.g., specific loss functions or evaluation measures for each task type) and any statistical significance testing of performance differences are insufficiently described, weakening the robustness of conclusions about when simple embeddings suffice versus when LLMs compete.

Authors: We concur that greater specificity on metrics and statistical testing will improve the reliability of our conclusions. The revised experimental evaluation section will explicitly state: classification tasks use accuracy, macro-F1, and AUC-ROC with cross-entropy loss; regression tasks use MSE and R^2 with MSE loss. All results are obtained via 5-fold cross-validation with fixed random seeds. We will add statistical significance testing using paired Wilcoxon signed-rank tests (with Bonferroni correction for multiple comparisons) to evaluate differences between simple-embedding pipelines and LLM-based ones, reporting p-values when discussing when simple embeddings suffice versus when LLMs become competitive. These updates will be incorporated into both the methods and results sections. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking study with no derivations or self-referential reductions

full rationale

This is a pure empirical benchmarking paper that collects 108 real-world tables and evaluates 445 pipelines via held-out performance. No equations, fitted parameters renamed as predictions, ansatzes, or uniqueness theorems appear in the manuscript. All central claims (e.g., simple embeddings suffice for categorical-dominant tables) are direct observations from the experimental results on the collected corpus rather than reductions to prior self-citations or input definitions. The study is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or invented entities; the work rests on the empirical collection of tables and standard ML evaluation practices.

pith-pipeline@v0.9.0 · 5620 in / 989 out tokens · 48971 ms · 2026-05-13T05:07:26.942195+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

178 extracted references · 178 canonical work pages · 4 internal anchors

[1]

TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields

Alan Arazi, Eilam Shapira, and Roi Reichart. TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields. In D. Belgrave, C. Zhang, H. Lin, R. Pas- canu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Informa- tion Processing Systems, volume 38, pages 172108–172161. Curran Associates, Inc.,

work page
[2]

URL https://proceedings.neurips.cc/paper_files/paper/2025/file/ faf6e23e198314c7728eaa6ac44ae079-Paper-Conference.pdf

work page 2025
[3]

Openml benchmark- ing suites

Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael Gomes Mantovani, Jan N van Rijn, and Joaquin Vanschoren. Openml benchmark- ing suites. InProceedings of the NeurIPS 2021 Datasets and Benchmarks Track, 2021

work page 2021
[4]

Openml: Insights from 10 years and more than a thousand papers.Patterns, 2025

Bernd Bischl, Giuseppe Casalicchio, Taniya Das, Matthias Feurer, Sebastian Fischer, Pieter Gijsbers, Subhaditya Mukherjee, Andreas C Müller, László Németh, Luis Oala, et al. Openml: Insights from 10 years and more than a thousand papers.Patterns, 2025

work page 2025
[5]

Encoding high-cardinality string categorical variables

Patricio Cerda and Gaël Varoquaux. Encoding high-cardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering, 34(3):1164–1176, 2020

work page 2020
[6]

In: Krishnapuram, B

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794. ACM, August 2016. doi: 10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016
[7]

On multiple comparisons procedures.Technical Report LA-7677-MS, Los Alamos Scientific Laboratory, 1979

William J Conover and Ronald L Iman. On multiple comparisons procedures.Technical Report LA-7677-MS, Los Alamos Scientific Laboratory, 1979

work page 1979
[8]

Data prep still dominates data scientists’ time, sur- vey finds, 2020

Datanami. Data prep still dominates data scientists’ time, sur- vey finds, 2020. URL https://www.datanami.com/2020/07/06/ data-prep-still-dominates-data-scientists-time-survey-finds/

work page 2020
[9]

In: 2009 IEEE Conference on Computer Vision and Pattern Recognition

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[10]

A formula for the gini coefficient.The Review of Economics and Statistics, 61 (1):146–49, 1979

Robert Dorfman. A formula for the gini coefficient.The Review of Economics and Statistics, 61 (1):146–49, 1979. URL https://EconPapers.repec.org/RePEc:tpr:restat:v:61:y: 1979:i:1:p:146-49. 10

work page 1979
[11]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

work page 2024
[12]

URL https://arxiv.org/abs/ 2502.13595

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi ´nski, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Ça˘gatan, Akash Kundu, Martin Bernstorff, Shi...

work page doi:10.48550/arxiv.2502.13595 2025
[13]

Tabarena: A living benchmark for machine learning on tabular data.Advances in Neural Information Processing Systems, 39, 2025

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. Tabarena: A living benchmark for machine learning on tabular data.Advances in Neural Information Processing Systems, 39, 2025

work page 2025
[14]

How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings, 2019

Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings, 2019. URL https://arxiv.org/abs/1909. 00512

work page 2019
[15]

Statistical Methods Related to the Law of the Iterated Logarithm

Milton Friedman. A Comparison of Alternative Tests of Significance for the Problem of m Rankings.The Annals of Mathematical Statistics, 11(1):86 – 92, 1940. doi: 10.1214/aoms/ 1177731944

work page doi:10.1214/aoms/ 1940
[16]

Representation degeneration problem in training natural language generation models.arXiv preprint arXiv:1907.12009, 2019

Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Representation degeneration problem in training natural language generation models, 2019. URL https://arxiv.org/ abs/1907.12009

work page arXiv 2019
[17]

Large scale transfer learning for tabular data via language modeling.Advances in Neural Information Processing Systems, 37:45155– 45205, 2024

Josh Gardner, Juan C Perdomo, and Ludwig Schmidt. Large scale transfer learning for tabular data via language modeling.Advances in Neural Information Processing Systems, 37:45155– 45205, 2024

work page 2024
[18]

Extremely randomized trees.Machine learning, 63(1):3–42, 2006

Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees.Machine learning, 63(1):3–42, 2006

work page 2006
[19]

Amlb: an automl benchmark.Journal of Machine Learning Research, 25(101):1–65, 2024

Pieter Gijsbers, Marcos LP Bueno, Stefan Coors, Erin LeDell, Sébastien Poirier, Janek Thomas, Bernd Bischl, and Joaquin Vanschoren. Amlb: an automl benchmark.Journal of Machine Learning Research, 25(101):1–65, 2024

work page 2024
[20]

Tabm: Advancing tabular deep learning with parameter-efficient ensembling

Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. Tabm: Advancing tabular deep learning with parameter-efficient ensembling. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[21]

The illusion of generalization: Re-examining tabular language model evaluation, 2026

Aditya Gorla and Ratish Puduppully. The illusion of generalization: Re-examining tabular language model evaluation, 2026. URLhttps://arxiv.org/abs/2602.04031

work page arXiv 2026
[22]

Why do tree-based models still outperform deep learning on typical tabular data?Advances in neural information processing systems, 35:507–520, 2022

Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data?Advances in neural information processing systems, 35:507–520, 2022. 11

work page 2022
[23]

Vectorizing string entries for data processing on tables: when are larger language models better?arXiv preprint arXiv:2312.09634, 2023

Léo Grinsztajn, Edouard Oyallon, Myung Jun Kim, and Gaël Varoquaux. Vectorizing string entries for data processing on tables: when are larger language models better?, 2023. URL https://arxiv.org/abs/2312.09634

work page arXiv 2023
[24]

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, Mihir Manium, Rosen Yu, Felix Jablon- ski, Shi Bin Hoo, Anurag Garg, Jake Robertson, Magnus Bühler, Vladyslav Moroshan, Lennart Purucker, Clara Cornu, Lilly Charlotte Wehrhahn, Alessandro Bonetto, Bernhard Schö...

work page internal anchor Pith review arXiv 2025
[25]

The emerging science of machine learning benchmarks

Moritz Hardt. The emerging science of machine learning benchmarks. Online at https: //mlbenchmarks.org, 2025. Manuscript

work page 2025
[26]

Springer New York, New York, NY ,

Winston Haynes.Holm’s Method, pages 902–902. Springer New York, New York, NY ,

work page
[27]

doi: 10.1007/978-1-4419-9863-7_1214

ISBN 978-1-4419-9863-7. doi: 10.1007/978-1-4419-9863-7_1214. URL https: //doi.org/10.1007/978-1-4419-9863-7_1214

work page doi:10.1007/978-1-4419-9863-7_1214
[28]

Hoerl and Robert W

Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970. ISSN 00401706. URL http://www.jstor. org/stable/1267351

work page arXiv 1970
[29]

Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

work page 2025
[30]

Better by default: Strong pre-tuned mlps and boosted trees on tabular data.Advances in Neural Information Processing Systems, 37:26577–26658, 2024

David Holzmüller, Léo Grinsztajn, and Ingo Steinwart. Better by default: Strong pre-tuned mlps and boosted trees on tabular data.Advances in Neural Information Processing Systems, 37:26577–26658, 2024

work page 2024
[31]

Principal component analysis: a review and recent developments

Ian T. Jolliffe and Jorge Cadima. Principal component analysis: a review and recent develop- ments.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineer- ing Sciences, 374(2065):20150202, 04 2016. ISSN 1364-503X. doi: 10.1098/rsta.2015.0202. URLhttps://doi.org/10.1098/rsta.2015.0202

work page doi:10.1098/rsta.2015.0202 2065
[32]

A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

Maurice G Kendall. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

work page 1938
[33]

Carte: Pretraining and transfer for tabular learning.ICML, 2024

Myung Jun Kim, Léo Grinsztajn, and Gaël Varoquaux. Carte: Pretraining and transfer for tabular learning.ICML, 2024

work page 2024
[34]

Table foundation models: on knowledge pre-training for tabular learning.TMLR, 2025

Myung Jun Kim, Félix Lefebvre, Gaëtan Brison, Alexandre Perez-Lebel, and Gaël Varoquaux. Table foundation models: on knowledge pre-training for tabular learning.TMLR, 2025

work page 2025
[35]

Pmlbmini: A tabular classification benchmark suite for data-scarce applications

Ricardo Knauer, Marvin Grimm, and Erik Rodner. Pmlbmini: A tabular classification benchmark suite for data-scarce applications. InAutoML Conference 2024 (ABCD Track), 2024

work page 2024
[36]

Springer, 2013

Max Kuhn and Kjell Johnson.Applied Predictive Modeling. Springer, 2013. ISBN 978-1-4614- 6848-6

work page 2013
[37]

Matryoshka representation learning, 2024

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, and Ali Farhadi. Matryoshka representation learning, 2024. URL https://arxiv.org/abs/2205. 13147

work page 2024
[38]

LeCun, L

Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791

work page doi:10.1109/5.726791 1998
[39]

When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems, 36:76336–76369, 2023

Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Ganesh Ramakr- ishnan, Micah Goldblum, and Colin White. When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems, 36:76336–76369, 2023

work page 2023
[40]

A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.ACM SIGKDD explorations newsletter, 3(1):27–32, 2001

Daniele Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.ACM SIGKDD explorations newsletter, 3(1):27–32, 2001. 12

work page 2001
[41]

Advances in pre-training distributed word representations

Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. Advances in pre-training distributed word representations. InProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA)

work page 2018
[42]

Towards benchmarking foundation models for tabular data with text, 2025

Martin Mráz, Breenda Das, Anshul Gupta, Lennart Purucker, and Frank Hutter. Towards benchmarking foundation models for tabular data with text, 2025. URL https://arxiv.org/ abs/2507.07829

work page arXiv 2025
[43]

MTEB: Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics

work page 2014
[44]

Transformers can do bayesian inference, 2024

Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do bayesian inference, 2024. URL https://arxiv.org/abs/2112.10510

work page arXiv 2024
[45]

Olson, William La Cava, Patryk Orzechowski, Ryan J

Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. Pmlb: a large benchmark suite for machine learning evaluation and comparison. BioData Mining, 10(1):36, Dec 2017. ISSN 1756-0381. doi: 10.1186/s13040-017-0154-4

work page doi:10.1186/s13040-017-0154-4 2017
[46]

Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

work page 2011
[47]

Catboost: unbiased boosting with categorical features.Advances in neural information processing systems, 31, 2018

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features.Advances in neural information processing systems, 31, 2018

work page 2018
[48]

Tabicl: A tabular foundation model for in-context learning on large data

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabicl: A tabular foundation model for in-context learning on large data. InForty-second International Conference on Machine Learning, 2025

work page 2025
[49]

Tabiclv2: A better, faster, scalable, and open tabular foundation model, 2026

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabiclv2: A better, faster, scalable, and open tabular foundation model, 2026. URL https://arxiv.org/abs/ 2602.11139

work page arXiv 2026
[50]

Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400. PMLR, 2019

work page 2019
[51]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019

work page 2019
[52]

A meta-analysis of overfitting in machine learning.Advances in neural information processing systems, 32, 2019

Rebecca Roelofs, Vaishaal Shankar, Benjamin Recht, Sara Fridovich-Keil, Moritz Hardt, John Miller, and Ludwig Schmidt. A meta-analysis of overfitting in machine learning.Advances in neural information processing systems, 32, 2019

work page 2019
[53]

Tabred: Analyzing pitfalls and filling the gaps in tabular deep learning benchmarks

Ivan Rubachev, Nikolay Kartashev, Yury Gorishniy, and Artem Babenko. Tabred: Analyzing pitfalls and filling the gaps in tabular deep learning benchmarks. InThe Thirteenth International Conference on Learning Representations, 2024

work page 2024
[54]

Tabrepo: A large scale repository of tabular model evaluations and its automl applications

David Salinas and Nick Erickson. Tabrepo: A large scale repository of tabular model evaluations and its automl applications. InAutoML Conference 2024 (ABCD Track), 2024

work page 2024
[55]

Importance of feature scaling

scikit-learn developers. Importance of feature scaling. https://scikit-learn.org/ stable/auto_examples/preprocessing/plot_scaling_importance.html, 2026. scikit-learn documentation, accessed April 2026

work page 2026
[56]

Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, and Alexander J. Smola. Benchmarking multimodal automl for tabular data with text fields, 2021. URL https://arxiv.org/abs/ 2111.02705. 13

work page arXiv 2021
[57]

Skrub software.https://skrub-data.org, 2026

Skrub. Skrub software.https://skrub-data.org, 2026

work page 2026
[58]

Contexttab: A semantics-aware tabular in-context learner.Advances in Neural Information Processing Systems, 39, 2025

Marco Spinaci, Marek Polewczyk, Maximilian Schambach, and Sam Thelin. Contexttab: A semantics-aware tabular in-context learner.Advances in Neural Information Processing Systems, 39, 2025

work page 2025
[59]

Machine learning and big data: What is important? IEEE Data Eng

Michael Stonebraker and El Kindi Rezig. Machine learning and big data: What is important? IEEE Data Eng. Bull., 42(4):3–7, 2019

work page 2019
[60]

Mambular: A sequential model for tabular deep learning, 2025

Anton Frederik Thielmann, Manish Kumar, Christoph Weisser, Arik Reuter, Benjamin Säfken, and Soheila Samiee. Mambular: A sequential model for tabular deep learning, 2025. URL https://arxiv.org/abs/2408.06291

work page arXiv 2025
[61]

Openml: networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014

Joaquin Vanschoren, Jan N Van Rijn, Bernd Bischl, and Luis Torgo. Openml: networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014

work page 2014
[62]

Cambridge University Press, 2018

Roman Vershynin.High-Dimensional Probability. Cambridge University Press, 2018

work page 2018
[63]

Towards Universal Tabular Embeddings: A Benchmark Across Data Tasks

Liane V ogel, Kavitha Srinivas, Niharika D’Souza, Sola Shirai, Oktie Hassanzadeh, and Horst Samulowitz. Towards universal tabular embeddings: A benchmark across data tasks, 2026. URLhttps://arxiv.org/abs/2604.21696

work page internal anchor Pith review Pith/arXiv arXiv 2026
[64]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[65]

Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83,

Frank Wilcoxon. Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83,

work page
[66]

URLhttp://www.jstor.org/stable/3001968

ISSN 00994987. URLhttp://www.jstor.org/stable/3001968

work page arXiv
[67]

Wolpert and W.G

D.H. Wolpert and W.G. Macready. No free lunch theorems for optimization.IEEE Transactions on Evolutionary Computation, 1(1):67–82, 1997. doi: 10.1109/4235.585893

work page doi:10.1109/4235.585893 1997
[68]

A closer look at deep learning methods on tabular datasets, 2025

Han-Jia Ye, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and De-Chuan Zhan. A closer look at deep learning methods on tabular datasets, 2025. URLhttps://arxiv.org/abs/2407.00956

work page arXiv 2025
[69]

Guri Zabërgja, Arlind Kadra, Christian M. M. Frey, and Josif Grabocka. Tabular data: Is deep learning all you need?, 2025. URLhttps://arxiv.org/abs/2402.03970

work page arXiv 2025
[70]

Learning task-agnostic representations through multi- teacher distillation

Dun Zhang, Ziyang Zeng, Yudong Zhou, and Shuyang Lu. Jasper-token-compression-600m technical report, 2025. URLhttps://arxiv.org/abs/2511.14405

work page arXiv 2025
[71]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embed- ding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025. 14 Appendices Contents A Detailed theoretical analysis 16 A.1 Problem setting an...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

The task is to predict the federal upper price limit

ACA Federal Upper Limits 5 Price limits for multi-source drugs under the Medicaid program. The task is to predict the federal upper price limit

work page
[73]

The task is to predict worker salaries

AI/ML Salaries6 Salary and basic information for workers in the machine learning and data science industry. The task is to predict worker salaries

work page
[74]

The task is to predict the severity of clinical signs

Animal and Veterinary Event7 Health problems reported in animals following the use of drug products. The task is to predict the severity of clinical signs

work page
[75]

The task is to predict the height of the structures

Antenna Structure Registration8 FCC registration data for antenna structures. The task is to predict the height of the structures

work page
[76]

The task is to predict the specific grant amount

Awarded Grants IMLS9 Grants awarded by the Institute of Museum and Library Services. The task is to predict the specific grant amount. 5https://www.medicaid.gov/medicaid/prescription-drugs/federal-upper-limit 6https://ai-jobs.net/salaries/download/salaries.csv 7https://open.fda.gov/apis/animalandveterinary/event/ 8https://hifld-geoplatform.opendata.arcgis...

work page
[77]

The task is to predict overall review ratings

Beer Ratings10 Tasting profiles and consumer reviews for over 3,000 unique beers. The task is to predict overall review ratings

work page
[78]

The task is to predict the maximum available download speed

Broadband Availability11 Data on internet speed and availability across the US. The task is to predict the maximum available download speed

work page
[79]

The task is to predict median house prices

California Housing12 Median house values and demographics from the 1990 California census. The task is to predict median house prices

work page 1990
[80]

The task is to predict healthcare performance scores

Child Adult Healthcare Quality13 Quality of care metrics for Medicaid and CHIP benefi- ciaries. The task is to predict healthcare performance scores

work page

Showing first 80 references.