pith. machine review for the scientific record. sign in

arxiv: 2605.12292 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: no theorem link

STRABLE: Benchmarking Tabular Machine Learning with Strings

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:07 UTC · model grok-4.3

classification 💻 cs.LG
keywords tabular learningstring embeddingsmixed data typesbenchmark suitecategorical dataLLM encodersempirical evaluationmodular pipelines
0
0 comments X

The pith

Most real-world tables mixing strings and numbers are categorical-dominant, so advanced tabular models paired with simple string embeddings deliver strong results at low cost, while large language model encoders become competitive only on a

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces STRABLE, a collection of 108 real-world tables that combine string entries with numeric ones across many application areas. It evaluates 445 pipelines that either model strings and numbers jointly in one architecture or encode strings first and then apply a separate tabular learner. The central result is that categorical strings predominate in these tables, making simple embeddings paired with strong tabular models effective and efficient. On the smaller subset of tables dominated by free text, large language model encoders reach competitive accuracy but require careful post-processing whose effects vary by model family. The benchmark produces pipeline rankings that generalize across tables, providing a stable foundation for further work on mixed string-numeric data.

Core claim

STRABLE supplies 108 tables drawn from diverse real-world problems. On this corpus, modular pipelines that encode strings simply and then apply advanced tabular learners outperform or match end-to-end string-numeric architectures for the majority of tables, which turn out to be categorical-dominant. On free-text-dominant tables, large LLM encoders become competitive, yet their success depends on the choice of post-processing step. Pipeline rankings obtained on STRABLE stay close to oracle rankings computed on held-out tables, confirming that the benchmark supports generalizable conclusions about string tabular learning.

What carries the argument

The STRABLE corpus of 108 mixed string-numeric tables together with the systematic comparison of modular encoding-plus-tabular pipelines against end-to-end architectures.

If this is right

  • For categorical-dominant tables, practitioners can obtain near-optimal accuracy by pairing any strong tabular learner with lightweight string encoders instead of training large joint models.
  • On free-text-dominant tables, switching to a large language model encoder is worthwhile, but the choice of post-processing layer must be validated because it affects relative performance across encoder families.
  • Benchmark rankings derived from STRABLE can be trusted to predict which pipelines will perform well on new tables of the same kind.
  • Future tabular learning research should treat string encoding as a first-class design choice rather than an afterthought.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The categorical-versus-free-text distinction offers a practical rule of thumb for selecting an encoding strategy before training begins.
  • Existing tabular benchmarks that contain only numeric columns may systematically underestimate the value of simple string handling techniques.
  • Extending STRABLE with tables that contain more complex string structures such as nested JSON or long documents would test whether the current conclusions continue to hold.

Load-bearing premise

The 108 tables assembled for STRABLE capture the distribution of string and numeric features that appear in typical real-world supervised learning tasks.

What would settle it

A new collection of several dozen mixed tables on which end-to-end string-numeric models consistently and substantially outperform simple-embedding-plus-tabular-learner pipelines would falsify the central empirical recommendation.

Figures

Figures reproduced from arXiv: 2605.12292 by Alan Arazi, David Holzm\"uller, Eilam Shapira, F\'elix Lefebvre, Frank Hutter, Ga\"el Varoquaux, Gioia Blayer, Lennart Purucker, Marine Le Morvan, Myung Jun Kim, Roi Reichart.

Figure 1
Figure 1. Figure 1: (a) STRABLE (solid) vs. OpenML (dashed); cardinality and string length aggregate over string columns. (b) Performance for Num-only, Str-only and full table (Num+Str) by learner. (0.5%), with the remaining 49.45% being plain Categoricals. Half (50.55%) consist of modalities that string-excluding and string-flattening benchmarks typically ignore or destroy (e.g.: PMLB ignores Names, Structured Codes, Free Te… view at source ↗
Figure 2
Figure 2. Figure 2: Post-processing affects LLM-based embeddings, especially for decoder-only models. Average score across 108 tables and five learners for 7 LM encoders under three post-processing variants. Each panel is one encoder; bars show the mean score under default 30-PCA (blue), standard scaling before 30-PCA (orange), and direct slicing of the first 30 raw embedding dimensions (green). Percentages indicate relative … view at source ↗
Figure 3
Figure 3. Figure 3: Critical difference diagram for encoder-learner pipelines. Pipelines’ average rank across the 108 datasets are shown in parentheses; lower is better. Dashed lines are E2E, continuous lines are Modular. Pipelines connected by horizontal bars are not statistically distinguishable at the indicated level (test statistic in Appendix D.1). Modular pipelines cluster at the top of the ranking. Pipelines marked wit… view at source ↗
Figure 4
Figure 4. Figure 4: Pareto-optimality plot and benchmark ranking stability. (a) Each point is a pipeline, colored by encoder on the left and by learner on the right. The dotted line is the pareto-optimality frontier. Encoders explain much of the runtime: for a given encoder, performance varies depending on the learner while runtime varies less (aside from tuning or not). Simple and advanced learners benefit differently from v… view at source ↗
Figure 5
Figure 5. Figure 5: (a) Kendall-τ correlation between application-specific subsets and the full benchmark (numbers in parentheses show the number of tables per application field). (b) Each row reports Kendall-τ between STRABLE’s ranking and the ranking of the opposite data preparation (e.g., applying feature engineering or missing-value imputation, which STRABLE does not; or removing target transformations and subsampling, wh… view at source ↗
Figure 6
Figure 6. Figure 6: Top-10 pipelines per leading string type. Datasets are grouped by their most frequent string type. In the Free Text regime large LLMs enter the top-10 paired with TabPFN-2.5; all other types mirror the global ranking (lightweight encoders at the top paired with TabPFN-2.5, and LM encoders paired with light learners like ExtraTrees). ConTextTab leads the Structured Code panel, plausibly aided by code-rich T… view at source ↗
read the original abstract

Benchmarking tabular learning has revealed the benefit of dedicated architectures, pushing the state of the art. But real-world tables often contain string entries, beyond numbers, and these settings have been understudied due to a lack of a solid benchmarking suite. They lead to new research questions: Are dedicated learners needed, with end-to-end modeling of strings and numbers? Or does it suffice to encode strings as numbers, as with a categorical encoding? And if so, do the resulting tables resemble numerical tabular data, calling for the same learners? To enable these studies, we contribute STRABLE, a benchmarking corpus of 108 tables, all real-world learning problems with strings and numbers across diverse application fields. We run the first large-scale empirical study of tabular learning with strings, evaluating 445 pipelines. These pipelines span end-to-end architectures and modular pipelines, where strings are first encoded, then post-processed, and finally passed to a tabular learner. We find that, because most tables in the wild are categorical-dominant, advanced tabular learners paired with simple string embeddings achieve good predictions at low computational cost. On free-text-dominant tables, large LLM encoders become competitive. Their performance also appears sensitive to post-processing, with differences across LLM families. Finally, we show that STRABLE is a good set of tables to study "string tabular" learning as it leads to generalizable pipeline rankings that are close to the oracle rankings. We thus establish STRABLE as a foundation for research on tabular learning with strings, an important yet understudied area.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces STRABLE, a benchmark of 108 real-world tables containing both strings and numerical features across diverse fields. It evaluates 445 pipelines combining various string encoders (simple embeddings, categorical encodings, and LLM-based) with tabular learners, finding that advanced tabular models paired with simple string embeddings perform well on categorical-dominant tables at low cost, while large LLM encoders become competitive on free-text-dominant tables. The work also validates that pipeline rankings on STRABLE are close to oracle rankings, positioning the benchmark as a foundation for research on tabular learning with strings.

Significance. If the empirical patterns hold and the tables are representative, this provides a much-needed dedicated benchmark and practical guidance for handling mixed string-numeric data, an understudied but common real-world setting. It highlights efficient, low-cost approaches for the majority of cases and identifies when more expensive LLM encoders add value, potentially influencing both practitioner choices and future model development in tabular ML.

major comments (2)
  1. [Benchmark construction / data collection section] The claim that 'most tables in the wild are categorical-dominant' and the resulting pipeline recommendations rest on the 108 tables being representative, yet the manuscript provides no explicit sampling frame, stratification by domain or string-type ratio, or distributional comparison against reference corpora such as OpenML or Kaggle. This is load-bearing for generalizing the findings beyond the specific benchmark.
  2. [Experimental evaluation / results section] While 445 pipelines are evaluated, details on the exact metrics (e.g., specific loss functions or evaluation measures for each task type) and any statistical significance testing of performance differences are insufficiently described, weakening the robustness of conclusions about when simple embeddings suffice versus when LLMs compete.
minor comments (2)
  1. [Introduction / abstract] Clarify the quantitative thresholds or definitions used to classify tables as 'categorical-dominant' versus 'free-text-dominant' with an explicit criterion or table in the main text.
  2. [Figures and results] Figure captions and legends should include more detail on the exact comparison being shown (e.g., which encoders and learners) to improve standalone readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps strengthen the presentation and robustness of our STRABLE benchmark. We address each major comment below with planned revisions to the manuscript.

read point-by-point responses
  1. Referee: The claim that 'most tables in the wild are categorical-dominant' and the resulting pipeline recommendations rest on the 108 tables being representative, yet the manuscript provides no explicit sampling frame, stratification by domain or string-type ratio, or distributional comparison against reference corpora such as OpenML or Kaggle. This is load-bearing for generalizing the findings beyond the specific benchmark.

    Authors: We agree that justifying representativeness is important for generalizing observations about categorical-dominant tables. In the revised manuscript, we will expand the benchmark construction section with a detailed account of our data collection process: tables were sourced from public repositories including Kaggle, UCI, and domain-specific datasets (e.g., healthcare, finance, e-commerce), filtered for mixed string-numeric content with at least one string column and sufficient size for learning tasks. We will add stratification details by domain and string-type ratio (e.g., proportion of free-text vs. categorical strings), along with distributional comparisons such as column-type histograms against OpenML and Kaggle subsets where direct access permits. While a complete probabilistic sampling frame for all real-world tables remains challenging without a universal registry, these additions will better support our claims and the resulting pipeline recommendations. revision: yes

  2. Referee: While 445 pipelines are evaluated, details on the exact metrics (e.g., specific loss functions or evaluation measures for each task type) and any statistical significance testing of performance differences are insufficiently described, weakening the robustness of conclusions about when simple embeddings suffice versus when LLMs compete.

    Authors: We concur that greater specificity on metrics and statistical testing will improve the reliability of our conclusions. The revised experimental evaluation section will explicitly state: classification tasks use accuracy, macro-F1, and AUC-ROC with cross-entropy loss; regression tasks use MSE and R^2 with MSE loss. All results are obtained via 5-fold cross-validation with fixed random seeds. We will add statistical significance testing using paired Wilcoxon signed-rank tests (with Bonferroni correction for multiple comparisons) to evaluate differences between simple-embedding pipelines and LLM-based ones, reporting p-values when discussing when simple embeddings suffice versus when LLMs become competitive. These updates will be incorporated into both the methods and results sections. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking study with no derivations or self-referential reductions

full rationale

This is a pure empirical benchmarking paper that collects 108 real-world tables and evaluates 445 pipelines via held-out performance. No equations, fitted parameters renamed as predictions, ansatzes, or uniqueness theorems appear in the manuscript. All central claims (e.g., simple embeddings suffice for categorical-dominant tables) are direct observations from the experimental results on the collected corpus rather than reductions to prior self-citations or input definitions. The study is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or invented entities; the work rests on the empirical collection of tables and standard ML evaluation practices.

pith-pipeline@v0.9.0 · 5620 in / 989 out tokens · 48971 ms · 2026-05-13T05:07:26.942195+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

178 extracted references · 178 canonical work pages · 4 internal anchors

  1. [1]

    TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields

    Alan Arazi, Eilam Shapira, and Roi Reichart. TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields. In D. Belgrave, C. Zhang, H. Lin, R. Pas- canu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Informa- tion Processing Systems, volume 38, pages 172108–172161. Curran Associates, Inc.,

  2. [2]

    URL https://proceedings.neurips.cc/paper_files/paper/2025/file/ faf6e23e198314c7728eaa6ac44ae079-Paper-Conference.pdf

  3. [3]

    Openml benchmark- ing suites

    Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael Gomes Mantovani, Jan N van Rijn, and Joaquin Vanschoren. Openml benchmark- ing suites. InProceedings of the NeurIPS 2021 Datasets and Benchmarks Track, 2021

  4. [4]

    Openml: Insights from 10 years and more than a thousand papers.Patterns, 2025

    Bernd Bischl, Giuseppe Casalicchio, Taniya Das, Matthias Feurer, Sebastian Fischer, Pieter Gijsbers, Subhaditya Mukherjee, Andreas C Müller, László Németh, Luis Oala, et al. Openml: Insights from 10 years and more than a thousand papers.Patterns, 2025

  5. [5]

    Encoding high-cardinality string categorical variables

    Patricio Cerda and Gaël Varoquaux. Encoding high-cardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering, 34(3):1164–1176, 2020

  6. [6]

    In: Krishnapuram, B

    Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794. ACM, August 2016. doi: 10.1145/2939672.2939785

  7. [7]

    On multiple comparisons procedures.Technical Report LA-7677-MS, Los Alamos Scientific Laboratory, 1979

    William J Conover and Ronald L Iman. On multiple comparisons procedures.Technical Report LA-7677-MS, Los Alamos Scientific Laboratory, 1979

  8. [8]

    Data prep still dominates data scientists’ time, sur- vey finds, 2020

    Datanami. Data prep still dominates data scientists’ time, sur- vey finds, 2020. URL https://www.datanami.com/2020/07/06/ data-prep-still-dominates-data-scientists-time-survey-finds/

  9. [9]

    In: 2009 IEEE Conference on Computer Vision and Pattern Recognition

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

  10. [10]

    A formula for the gini coefficient.The Review of Economics and Statistics, 61 (1):146–49, 1979

    Robert Dorfman. A formula for the gini coefficient.The Review of Economics and Statistics, 61 (1):146–49, 1979. URL https://EconPapers.repec.org/RePEc:tpr:restat:v:61:y: 1979:i:1:p:146-49. 10

  11. [11]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

  12. [12]

    URL https://arxiv.org/abs/ 2502.13595

    Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi ´nski, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Ça˘gatan, Akash Kundu, Martin Bernstorff, Shi...

  13. [13]

    Tabarena: A living benchmark for machine learning on tabular data.Advances in Neural Information Processing Systems, 39, 2025

    Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. Tabarena: A living benchmark for machine learning on tabular data.Advances in Neural Information Processing Systems, 39, 2025

  14. [14]

    How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings, 2019

    Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings, 2019. URL https://arxiv.org/abs/1909. 00512

  15. [15]

    Statistical Methods Related to the Law of the Iterated Logarithm

    Milton Friedman. A Comparison of Alternative Tests of Significance for the Problem of m Rankings.The Annals of Mathematical Statistics, 11(1):86 – 92, 1940. doi: 10.1214/aoms/ 1177731944

  16. [16]

    Representation degeneration problem in training natural language generation models.arXiv preprint arXiv:1907.12009, 2019

    Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Representation degeneration problem in training natural language generation models, 2019. URL https://arxiv.org/ abs/1907.12009

  17. [17]

    Large scale transfer learning for tabular data via language modeling.Advances in Neural Information Processing Systems, 37:45155– 45205, 2024

    Josh Gardner, Juan C Perdomo, and Ludwig Schmidt. Large scale transfer learning for tabular data via language modeling.Advances in Neural Information Processing Systems, 37:45155– 45205, 2024

  18. [18]

    Extremely randomized trees.Machine learning, 63(1):3–42, 2006

    Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees.Machine learning, 63(1):3–42, 2006

  19. [19]

    Amlb: an automl benchmark.Journal of Machine Learning Research, 25(101):1–65, 2024

    Pieter Gijsbers, Marcos LP Bueno, Stefan Coors, Erin LeDell, Sébastien Poirier, Janek Thomas, Bernd Bischl, and Joaquin Vanschoren. Amlb: an automl benchmark.Journal of Machine Learning Research, 25(101):1–65, 2024

  20. [20]

    Tabm: Advancing tabular deep learning with parameter-efficient ensembling

    Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. Tabm: Advancing tabular deep learning with parameter-efficient ensembling. InThe Thirteenth International Conference on Learning Representations, 2025

  21. [21]

    The illusion of generalization: Re-examining tabular language model evaluation, 2026

    Aditya Gorla and Ratish Puduppully. The illusion of generalization: Re-examining tabular language model evaluation, 2026. URLhttps://arxiv.org/abs/2602.04031

  22. [22]

    Why do tree-based models still outperform deep learning on typical tabular data?Advances in neural information processing systems, 35:507–520, 2022

    Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data?Advances in neural information processing systems, 35:507–520, 2022. 11

  23. [23]

    Vectorizing string entries for data processing on tables: when are larger language models better?arXiv preprint arXiv:2312.09634, 2023

    Léo Grinsztajn, Edouard Oyallon, Myung Jun Kim, and Gaël Varoquaux. Vectorizing string entries for data processing on tables: when are larger language models better?, 2023. URL https://arxiv.org/abs/2312.09634

  24. [24]

    TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

    Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, Mihir Manium, Rosen Yu, Felix Jablon- ski, Shi Bin Hoo, Anurag Garg, Jake Robertson, Magnus Bühler, Vladyslav Moroshan, Lennart Purucker, Clara Cornu, Lilly Charlotte Wehrhahn, Alessandro Bonetto, Bernhard Schö...

  25. [25]

    The emerging science of machine learning benchmarks

    Moritz Hardt. The emerging science of machine learning benchmarks. Online at https: //mlbenchmarks.org, 2025. Manuscript

  26. [26]

    Springer New York, New York, NY ,

    Winston Haynes.Holm’s Method, pages 902–902. Springer New York, New York, NY ,

  27. [27]

    doi: 10.1007/978-1-4419-9863-7_1214

    ISBN 978-1-4419-9863-7. doi: 10.1007/978-1-4419-9863-7_1214. URL https: //doi.org/10.1007/978-1-4419-9863-7_1214

  28. [28]

    Hoerl and Robert W

    Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970. ISSN 00401706. URL http://www.jstor. org/stable/1267351

  29. [29]

    Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

    Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

  30. [30]

    Better by default: Strong pre-tuned mlps and boosted trees on tabular data.Advances in Neural Information Processing Systems, 37:26577–26658, 2024

    David Holzmüller, Léo Grinsztajn, and Ingo Steinwart. Better by default: Strong pre-tuned mlps and boosted trees on tabular data.Advances in Neural Information Processing Systems, 37:26577–26658, 2024

  31. [31]

    Principal component analysis: a review and recent developments

    Ian T. Jolliffe and Jorge Cadima. Principal component analysis: a review and recent develop- ments.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineer- ing Sciences, 374(2065):20150202, 04 2016. ISSN 1364-503X. doi: 10.1098/rsta.2015.0202. URLhttps://doi.org/10.1098/rsta.2015.0202

  32. [32]

    A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

    Maurice G Kendall. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

  33. [33]

    Carte: Pretraining and transfer for tabular learning.ICML, 2024

    Myung Jun Kim, Léo Grinsztajn, and Gaël Varoquaux. Carte: Pretraining and transfer for tabular learning.ICML, 2024

  34. [34]

    Table foundation models: on knowledge pre-training for tabular learning.TMLR, 2025

    Myung Jun Kim, Félix Lefebvre, Gaëtan Brison, Alexandre Perez-Lebel, and Gaël Varoquaux. Table foundation models: on knowledge pre-training for tabular learning.TMLR, 2025

  35. [35]

    Pmlbmini: A tabular classification benchmark suite for data-scarce applications

    Ricardo Knauer, Marvin Grimm, and Erik Rodner. Pmlbmini: A tabular classification benchmark suite for data-scarce applications. InAutoML Conference 2024 (ABCD Track), 2024

  36. [36]

    Springer, 2013

    Max Kuhn and Kjell Johnson.Applied Predictive Modeling. Springer, 2013. ISBN 978-1-4614- 6848-6

  37. [37]

    Matryoshka representation learning, 2024

    Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, and Ali Farhadi. Matryoshka representation learning, 2024. URL https://arxiv.org/abs/2205. 13147

  38. [38]

    LeCun, L

    Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791

  39. [39]

    When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems, 36:76336–76369, 2023

    Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Ganesh Ramakr- ishnan, Micah Goldblum, and Colin White. When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems, 36:76336–76369, 2023

  40. [40]

    A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.ACM SIGKDD explorations newsletter, 3(1):27–32, 2001

    Daniele Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.ACM SIGKDD explorations newsletter, 3(1):27–32, 2001. 12

  41. [41]

    Advances in pre-training distributed word representations

    Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. Advances in pre-training distributed word representations. InProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA)

  42. [42]

    Towards benchmarking foundation models for tabular data with text, 2025

    Martin Mráz, Breenda Das, Anshul Gupta, Lennart Purucker, and Frank Hutter. Towards benchmarking foundation models for tabular data with text, 2025. URL https://arxiv.org/ abs/2507.07829

  43. [43]

    MTEB: Massive text embedding benchmark

    Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics

  44. [44]

    Transformers can do bayesian inference, 2024

    Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do bayesian inference, 2024. URL https://arxiv.org/abs/2112.10510

  45. [45]

    Olson, William La Cava, Patryk Orzechowski, Ryan J

    Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. Pmlb: a large benchmark suite for machine learning evaluation and comparison. BioData Mining, 10(1):36, Dec 2017. ISSN 1756-0381. doi: 10.1186/s13040-017-0154-4

  46. [46]

    Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

    Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

  47. [47]

    Catboost: unbiased boosting with categorical features.Advances in neural information processing systems, 31, 2018

    Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features.Advances in neural information processing systems, 31, 2018

  48. [48]

    Tabicl: A tabular foundation model for in-context learning on large data

    Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabicl: A tabular foundation model for in-context learning on large data. InForty-second International Conference on Machine Learning, 2025

  49. [49]

    Tabiclv2: A better, faster, scalable, and open tabular foundation model, 2026

    Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabiclv2: A better, faster, scalable, and open tabular foundation model, 2026. URL https://arxiv.org/abs/ 2602.11139

  50. [50]

    Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400. PMLR, 2019

  51. [51]

    Sentence-bert: Sentence embeddings using siamese bert- networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019

  52. [52]

    A meta-analysis of overfitting in machine learning.Advances in neural information processing systems, 32, 2019

    Rebecca Roelofs, Vaishaal Shankar, Benjamin Recht, Sara Fridovich-Keil, Moritz Hardt, John Miller, and Ludwig Schmidt. A meta-analysis of overfitting in machine learning.Advances in neural information processing systems, 32, 2019

  53. [53]

    Tabred: Analyzing pitfalls and filling the gaps in tabular deep learning benchmarks

    Ivan Rubachev, Nikolay Kartashev, Yury Gorishniy, and Artem Babenko. Tabred: Analyzing pitfalls and filling the gaps in tabular deep learning benchmarks. InThe Thirteenth International Conference on Learning Representations, 2024

  54. [54]

    Tabrepo: A large scale repository of tabular model evaluations and its automl applications

    David Salinas and Nick Erickson. Tabrepo: A large scale repository of tabular model evaluations and its automl applications. InAutoML Conference 2024 (ABCD Track), 2024

  55. [55]

    Importance of feature scaling

    scikit-learn developers. Importance of feature scaling. https://scikit-learn.org/ stable/auto_examples/preprocessing/plot_scaling_importance.html, 2026. scikit-learn documentation, accessed April 2026

  56. [56]

    Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, and Alexander J. Smola. Benchmarking multimodal automl for tabular data with text fields, 2021. URL https://arxiv.org/abs/ 2111.02705. 13

  57. [57]

    Skrub software.https://skrub-data.org, 2026

    Skrub. Skrub software.https://skrub-data.org, 2026

  58. [58]

    Contexttab: A semantics-aware tabular in-context learner.Advances in Neural Information Processing Systems, 39, 2025

    Marco Spinaci, Marek Polewczyk, Maximilian Schambach, and Sam Thelin. Contexttab: A semantics-aware tabular in-context learner.Advances in Neural Information Processing Systems, 39, 2025

  59. [59]

    Machine learning and big data: What is important? IEEE Data Eng

    Michael Stonebraker and El Kindi Rezig. Machine learning and big data: What is important? IEEE Data Eng. Bull., 42(4):3–7, 2019

  60. [60]

    Mambular: A sequential model for tabular deep learning, 2025

    Anton Frederik Thielmann, Manish Kumar, Christoph Weisser, Arik Reuter, Benjamin Säfken, and Soheila Samiee. Mambular: A sequential model for tabular deep learning, 2025. URL https://arxiv.org/abs/2408.06291

  61. [61]

    Openml: networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014

    Joaquin Vanschoren, Jan N Van Rijn, Bernd Bischl, and Luis Torgo. Openml: networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014

  62. [62]

    Cambridge University Press, 2018

    Roman Vershynin.High-Dimensional Probability. Cambridge University Press, 2018

  63. [63]

    Towards Universal Tabular Embeddings: A Benchmark Across Data Tasks

    Liane V ogel, Kavitha Srinivas, Niharika D’Souza, Sola Shirai, Oktie Hassanzadeh, and Horst Samulowitz. Towards universal tabular embeddings: A benchmark across data tasks, 2026. URLhttps://arxiv.org/abs/2604.21696

  64. [64]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

  65. [65]

    Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83,

    Frank Wilcoxon. Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83,

  66. [66]

    URLhttp://www.jstor.org/stable/3001968

    ISSN 00994987. URLhttp://www.jstor.org/stable/3001968

  67. [67]

    Wolpert and W.G

    D.H. Wolpert and W.G. Macready. No free lunch theorems for optimization.IEEE Transactions on Evolutionary Computation, 1(1):67–82, 1997. doi: 10.1109/4235.585893

  68. [68]

    A closer look at deep learning methods on tabular datasets, 2025

    Han-Jia Ye, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and De-Chuan Zhan. A closer look at deep learning methods on tabular datasets, 2025. URLhttps://arxiv.org/abs/2407.00956

  69. [69]

    Guri Zabërgja, Arlind Kadra, Christian M. M. Frey, and Josif Grabocka. Tabular data: Is deep learning all you need?, 2025. URLhttps://arxiv.org/abs/2402.03970

  70. [70]

    Learning task-agnostic representations through multi- teacher distillation

    Dun Zhang, Ziyang Zeng, Yudong Zhou, and Shuyang Lu. Jasper-token-compression-600m technical report, 2025. URLhttps://arxiv.org/abs/2511.14405

  71. [71]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embed- ding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025. 14 Appendices Contents A Detailed theoretical analysis 16 A.1 Problem setting an...

  72. [72]

    The task is to predict the federal upper price limit

    ACA Federal Upper Limits 5 Price limits for multi-source drugs under the Medicaid program. The task is to predict the federal upper price limit

  73. [73]

    The task is to predict worker salaries

    AI/ML Salaries6 Salary and basic information for workers in the machine learning and data science industry. The task is to predict worker salaries

  74. [74]

    The task is to predict the severity of clinical signs

    Animal and Veterinary Event7 Health problems reported in animals following the use of drug products. The task is to predict the severity of clinical signs

  75. [75]

    The task is to predict the height of the structures

    Antenna Structure Registration8 FCC registration data for antenna structures. The task is to predict the height of the structures

  76. [76]

    The task is to predict the specific grant amount

    Awarded Grants IMLS9 Grants awarded by the Institute of Museum and Library Services. The task is to predict the specific grant amount. 5https://www.medicaid.gov/medicaid/prescription-drugs/federal-upper-limit 6https://ai-jobs.net/salaries/download/salaries.csv 7https://open.fda.gov/apis/animalandveterinary/event/ 8https://hifld-geoplatform.opendata.arcgis...

  77. [77]

    The task is to predict overall review ratings

    Beer Ratings10 Tasting profiles and consumer reviews for over 3,000 unique beers. The task is to predict overall review ratings

  78. [78]

    The task is to predict the maximum available download speed

    Broadband Availability11 Data on internet speed and availability across the US. The task is to predict the maximum available download speed

  79. [79]

    The task is to predict median house prices

    California Housing12 Median house values and demographics from the 1990 California census. The task is to predict median house prices

  80. [80]

    The task is to predict healthcare performance scores

    Child Adult Healthcare Quality13 Quality of care metrics for Medicaid and CHIP benefi- ciaries. The task is to predict healthcare performance scores

Showing first 80 references.