Recognition: no theorem link
STRABLE: Benchmarking Tabular Machine Learning with Strings
Pith reviewed 2026-05-13 05:07 UTC · model grok-4.3
The pith
Most real-world tables mixing strings and numbers are categorical-dominant, so advanced tabular models paired with simple string embeddings deliver strong results at low cost, while large language model encoders become competitive only on a
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
STRABLE supplies 108 tables drawn from diverse real-world problems. On this corpus, modular pipelines that encode strings simply and then apply advanced tabular learners outperform or match end-to-end string-numeric architectures for the majority of tables, which turn out to be categorical-dominant. On free-text-dominant tables, large LLM encoders become competitive, yet their success depends on the choice of post-processing step. Pipeline rankings obtained on STRABLE stay close to oracle rankings computed on held-out tables, confirming that the benchmark supports generalizable conclusions about string tabular learning.
What carries the argument
The STRABLE corpus of 108 mixed string-numeric tables together with the systematic comparison of modular encoding-plus-tabular pipelines against end-to-end architectures.
If this is right
- For categorical-dominant tables, practitioners can obtain near-optimal accuracy by pairing any strong tabular learner with lightweight string encoders instead of training large joint models.
- On free-text-dominant tables, switching to a large language model encoder is worthwhile, but the choice of post-processing layer must be validated because it affects relative performance across encoder families.
- Benchmark rankings derived from STRABLE can be trusted to predict which pipelines will perform well on new tables of the same kind.
- Future tabular learning research should treat string encoding as a first-class design choice rather than an afterthought.
Where Pith is reading between the lines
- The categorical-versus-free-text distinction offers a practical rule of thumb for selecting an encoding strategy before training begins.
- Existing tabular benchmarks that contain only numeric columns may systematically underestimate the value of simple string handling techniques.
- Extending STRABLE with tables that contain more complex string structures such as nested JSON or long documents would test whether the current conclusions continue to hold.
Load-bearing premise
The 108 tables assembled for STRABLE capture the distribution of string and numeric features that appear in typical real-world supervised learning tasks.
What would settle it
A new collection of several dozen mixed tables on which end-to-end string-numeric models consistently and substantially outperform simple-embedding-plus-tabular-learner pipelines would falsify the central empirical recommendation.
Figures
read the original abstract
Benchmarking tabular learning has revealed the benefit of dedicated architectures, pushing the state of the art. But real-world tables often contain string entries, beyond numbers, and these settings have been understudied due to a lack of a solid benchmarking suite. They lead to new research questions: Are dedicated learners needed, with end-to-end modeling of strings and numbers? Or does it suffice to encode strings as numbers, as with a categorical encoding? And if so, do the resulting tables resemble numerical tabular data, calling for the same learners? To enable these studies, we contribute STRABLE, a benchmarking corpus of 108 tables, all real-world learning problems with strings and numbers across diverse application fields. We run the first large-scale empirical study of tabular learning with strings, evaluating 445 pipelines. These pipelines span end-to-end architectures and modular pipelines, where strings are first encoded, then post-processed, and finally passed to a tabular learner. We find that, because most tables in the wild are categorical-dominant, advanced tabular learners paired with simple string embeddings achieve good predictions at low computational cost. On free-text-dominant tables, large LLM encoders become competitive. Their performance also appears sensitive to post-processing, with differences across LLM families. Finally, we show that STRABLE is a good set of tables to study "string tabular" learning as it leads to generalizable pipeline rankings that are close to the oracle rankings. We thus establish STRABLE as a foundation for research on tabular learning with strings, an important yet understudied area.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces STRABLE, a benchmark of 108 real-world tables containing both strings and numerical features across diverse fields. It evaluates 445 pipelines combining various string encoders (simple embeddings, categorical encodings, and LLM-based) with tabular learners, finding that advanced tabular models paired with simple string embeddings perform well on categorical-dominant tables at low cost, while large LLM encoders become competitive on free-text-dominant tables. The work also validates that pipeline rankings on STRABLE are close to oracle rankings, positioning the benchmark as a foundation for research on tabular learning with strings.
Significance. If the empirical patterns hold and the tables are representative, this provides a much-needed dedicated benchmark and practical guidance for handling mixed string-numeric data, an understudied but common real-world setting. It highlights efficient, low-cost approaches for the majority of cases and identifies when more expensive LLM encoders add value, potentially influencing both practitioner choices and future model development in tabular ML.
major comments (2)
- [Benchmark construction / data collection section] The claim that 'most tables in the wild are categorical-dominant' and the resulting pipeline recommendations rest on the 108 tables being representative, yet the manuscript provides no explicit sampling frame, stratification by domain or string-type ratio, or distributional comparison against reference corpora such as OpenML or Kaggle. This is load-bearing for generalizing the findings beyond the specific benchmark.
- [Experimental evaluation / results section] While 445 pipelines are evaluated, details on the exact metrics (e.g., specific loss functions or evaluation measures for each task type) and any statistical significance testing of performance differences are insufficiently described, weakening the robustness of conclusions about when simple embeddings suffice versus when LLMs compete.
minor comments (2)
- [Introduction / abstract] Clarify the quantitative thresholds or definitions used to classify tables as 'categorical-dominant' versus 'free-text-dominant' with an explicit criterion or table in the main text.
- [Figures and results] Figure captions and legends should include more detail on the exact comparison being shown (e.g., which encoders and learners) to improve standalone readability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which helps strengthen the presentation and robustness of our STRABLE benchmark. We address each major comment below with planned revisions to the manuscript.
read point-by-point responses
-
Referee: The claim that 'most tables in the wild are categorical-dominant' and the resulting pipeline recommendations rest on the 108 tables being representative, yet the manuscript provides no explicit sampling frame, stratification by domain or string-type ratio, or distributional comparison against reference corpora such as OpenML or Kaggle. This is load-bearing for generalizing the findings beyond the specific benchmark.
Authors: We agree that justifying representativeness is important for generalizing observations about categorical-dominant tables. In the revised manuscript, we will expand the benchmark construction section with a detailed account of our data collection process: tables were sourced from public repositories including Kaggle, UCI, and domain-specific datasets (e.g., healthcare, finance, e-commerce), filtered for mixed string-numeric content with at least one string column and sufficient size for learning tasks. We will add stratification details by domain and string-type ratio (e.g., proportion of free-text vs. categorical strings), along with distributional comparisons such as column-type histograms against OpenML and Kaggle subsets where direct access permits. While a complete probabilistic sampling frame for all real-world tables remains challenging without a universal registry, these additions will better support our claims and the resulting pipeline recommendations. revision: yes
-
Referee: While 445 pipelines are evaluated, details on the exact metrics (e.g., specific loss functions or evaluation measures for each task type) and any statistical significance testing of performance differences are insufficiently described, weakening the robustness of conclusions about when simple embeddings suffice versus when LLMs compete.
Authors: We concur that greater specificity on metrics and statistical testing will improve the reliability of our conclusions. The revised experimental evaluation section will explicitly state: classification tasks use accuracy, macro-F1, and AUC-ROC with cross-entropy loss; regression tasks use MSE and R^2 with MSE loss. All results are obtained via 5-fold cross-validation with fixed random seeds. We will add statistical significance testing using paired Wilcoxon signed-rank tests (with Bonferroni correction for multiple comparisons) to evaluate differences between simple-embedding pipelines and LLM-based ones, reporting p-values when discussing when simple embeddings suffice versus when LLMs become competitive. These updates will be incorporated into both the methods and results sections. revision: yes
Circularity Check
No circularity: empirical benchmarking study with no derivations or self-referential reductions
full rationale
This is a pure empirical benchmarking paper that collects 108 real-world tables and evaluates 445 pipelines via held-out performance. No equations, fitted parameters renamed as predictions, ansatzes, or uniqueness theorems appear in the manuscript. All central claims (e.g., simple embeddings suffice for categorical-dominant tables) are direct observations from the experimental results on the collected corpus rather than reductions to prior self-citations or input definitions. The study is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields
Alan Arazi, Eilam Shapira, and Roi Reichart. TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields. In D. Belgrave, C. Zhang, H. Lin, R. Pas- canu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Informa- tion Processing Systems, volume 38, pages 172108–172161. Curran Associates, Inc.,
-
[2]
URL https://proceedings.neurips.cc/paper_files/paper/2025/file/ faf6e23e198314c7728eaa6ac44ae079-Paper-Conference.pdf
work page 2025
-
[3]
Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael Gomes Mantovani, Jan N van Rijn, and Joaquin Vanschoren. Openml benchmark- ing suites. InProceedings of the NeurIPS 2021 Datasets and Benchmarks Track, 2021
work page 2021
-
[4]
Openml: Insights from 10 years and more than a thousand papers.Patterns, 2025
Bernd Bischl, Giuseppe Casalicchio, Taniya Das, Matthias Feurer, Sebastian Fischer, Pieter Gijsbers, Subhaditya Mukherjee, Andreas C Müller, László Németh, Luis Oala, et al. Openml: Insights from 10 years and more than a thousand papers.Patterns, 2025
work page 2025
-
[5]
Encoding high-cardinality string categorical variables
Patricio Cerda and Gaël Varoquaux. Encoding high-cardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering, 34(3):1164–1176, 2020
work page 2020
-
[6]
Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794. ACM, August 2016. doi: 10.1145/2939672.2939785
-
[7]
William J Conover and Ronald L Iman. On multiple comparisons procedures.Technical Report LA-7677-MS, Los Alamos Scientific Laboratory, 1979
work page 1979
-
[8]
Data prep still dominates data scientists’ time, sur- vey finds, 2020
Datanami. Data prep still dominates data scientists’ time, sur- vey finds, 2020. URL https://www.datanami.com/2020/07/06/ data-prep-still-dominates-data-scientists-time-survey-finds/
work page 2020
-
[9]
In: 2009 IEEE Conference on Computer Vision and Pattern Recognition
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848
-
[10]
A formula for the gini coefficient.The Review of Economics and Statistics, 61 (1):146–49, 1979
Robert Dorfman. A formula for the gini coefficient.The Review of Economics and Statistics, 61 (1):146–49, 1979. URL https://EconPapers.repec.org/RePEc:tpr:restat:v:61:y: 1979:i:1:p:146-49. 10
work page 1979
-
[11]
The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
work page 2024
-
[12]
URL https://arxiv.org/abs/ 2502.13595
Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi ´nski, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Ça˘gatan, Akash Kundu, Martin Bernstorff, Shi...
-
[13]
Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. Tabarena: A living benchmark for machine learning on tabular data.Advances in Neural Information Processing Systems, 39, 2025
work page 2025
-
[14]
Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings, 2019. URL https://arxiv.org/abs/1909. 00512
work page 2019
-
[15]
Statistical Methods Related to the Law of the Iterated Logarithm
Milton Friedman. A Comparison of Alternative Tests of Significance for the Problem of m Rankings.The Annals of Mathematical Statistics, 11(1):86 – 92, 1940. doi: 10.1214/aoms/ 1177731944
-
[16]
Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Representation degeneration problem in training natural language generation models, 2019. URL https://arxiv.org/ abs/1907.12009
-
[17]
Josh Gardner, Juan C Perdomo, and Ludwig Schmidt. Large scale transfer learning for tabular data via language modeling.Advances in Neural Information Processing Systems, 37:45155– 45205, 2024
work page 2024
-
[18]
Extremely randomized trees.Machine learning, 63(1):3–42, 2006
Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees.Machine learning, 63(1):3–42, 2006
work page 2006
-
[19]
Amlb: an automl benchmark.Journal of Machine Learning Research, 25(101):1–65, 2024
Pieter Gijsbers, Marcos LP Bueno, Stefan Coors, Erin LeDell, Sébastien Poirier, Janek Thomas, Bernd Bischl, and Joaquin Vanschoren. Amlb: an automl benchmark.Journal of Machine Learning Research, 25(101):1–65, 2024
work page 2024
-
[20]
Tabm: Advancing tabular deep learning with parameter-efficient ensembling
Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. Tabm: Advancing tabular deep learning with parameter-efficient ensembling. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[21]
The illusion of generalization: Re-examining tabular language model evaluation, 2026
Aditya Gorla and Ratish Puduppully. The illusion of generalization: Re-examining tabular language model evaluation, 2026. URLhttps://arxiv.org/abs/2602.04031
-
[22]
Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data?Advances in neural information processing systems, 35:507–520, 2022. 11
work page 2022
-
[23]
Léo Grinsztajn, Edouard Oyallon, Myung Jun Kim, and Gaël Varoquaux. Vectorizing string entries for data processing on tables: when are larger language models better?, 2023. URL https://arxiv.org/abs/2312.09634
-
[24]
TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models
Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, Mihir Manium, Rosen Yu, Felix Jablon- ski, Shi Bin Hoo, Anurag Garg, Jake Robertson, Magnus Bühler, Vladyslav Moroshan, Lennart Purucker, Clara Cornu, Lilly Charlotte Wehrhahn, Alessandro Bonetto, Bernhard Schö...
work page internal anchor Pith review arXiv 2025
-
[25]
The emerging science of machine learning benchmarks
Moritz Hardt. The emerging science of machine learning benchmarks. Online at https: //mlbenchmarks.org, 2025. Manuscript
work page 2025
-
[26]
Springer New York, New York, NY ,
Winston Haynes.Holm’s Method, pages 902–902. Springer New York, New York, NY ,
-
[27]
doi: 10.1007/978-1-4419-9863-7_1214
ISBN 978-1-4419-9863-7. doi: 10.1007/978-1-4419-9863-7_1214. URL https: //doi.org/10.1007/978-1-4419-9863-7_1214
-
[28]
Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970. ISSN 00401706. URL http://www.jstor. org/stable/1267351
-
[29]
Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025
Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025
work page 2025
-
[30]
David Holzmüller, Léo Grinsztajn, and Ingo Steinwart. Better by default: Strong pre-tuned mlps and boosted trees on tabular data.Advances in Neural Information Processing Systems, 37:26577–26658, 2024
work page 2024
-
[31]
Principal component analysis: a review and recent developments
Ian T. Jolliffe and Jorge Cadima. Principal component analysis: a review and recent develop- ments.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineer- ing Sciences, 374(2065):20150202, 04 2016. ISSN 1364-503X. doi: 10.1098/rsta.2015.0202. URLhttps://doi.org/10.1098/rsta.2015.0202
-
[32]
A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938
Maurice G Kendall. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938
work page 1938
-
[33]
Carte: Pretraining and transfer for tabular learning.ICML, 2024
Myung Jun Kim, Léo Grinsztajn, and Gaël Varoquaux. Carte: Pretraining and transfer for tabular learning.ICML, 2024
work page 2024
-
[34]
Table foundation models: on knowledge pre-training for tabular learning.TMLR, 2025
Myung Jun Kim, Félix Lefebvre, Gaëtan Brison, Alexandre Perez-Lebel, and Gaël Varoquaux. Table foundation models: on knowledge pre-training for tabular learning.TMLR, 2025
work page 2025
-
[35]
Pmlbmini: A tabular classification benchmark suite for data-scarce applications
Ricardo Knauer, Marvin Grimm, and Erik Rodner. Pmlbmini: A tabular classification benchmark suite for data-scarce applications. InAutoML Conference 2024 (ABCD Track), 2024
work page 2024
-
[36]
Max Kuhn and Kjell Johnson.Applied Predictive Modeling. Springer, 2013. ISBN 978-1-4614- 6848-6
work page 2013
-
[37]
Matryoshka representation learning, 2024
Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, and Ali Farhadi. Matryoshka representation learning, 2024. URL https://arxiv.org/abs/2205. 13147
work page 2024
-
[38]
Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791
-
[39]
Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Ganesh Ramakr- ishnan, Micah Goldblum, and Colin White. When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems, 36:76336–76369, 2023
work page 2023
-
[40]
Daniele Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.ACM SIGKDD explorations newsletter, 3(1):27–32, 2001. 12
work page 2001
-
[41]
Advances in pre-training distributed word representations
Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. Advances in pre-training distributed word representations. InProceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA)
work page 2018
-
[42]
Towards benchmarking foundation models for tabular data with text, 2025
Martin Mráz, Breenda Das, Anshul Gupta, Lennart Purucker, and Frank Hutter. Towards benchmarking foundation models for tabular data with text, 2025. URL https://arxiv.org/ abs/2507.07829
-
[43]
MTEB: Massive text embedding benchmark
Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics
work page 2014
-
[44]
Transformers can do bayesian inference, 2024
Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do bayesian inference, 2024. URL https://arxiv.org/abs/2112.10510
-
[45]
Olson, William La Cava, Patryk Orzechowski, Ryan J
Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. Pmlb: a large benchmark suite for machine learning evaluation and comparison. BioData Mining, 10(1):36, Dec 2017. ISSN 1756-0381. doi: 10.1186/s13040-017-0154-4
-
[46]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011
work page 2011
-
[47]
Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features.Advances in neural information processing systems, 31, 2018
work page 2018
-
[48]
Tabicl: A tabular foundation model for in-context learning on large data
Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabicl: A tabular foundation model for in-context learning on large data. InForty-second International Conference on Machine Learning, 2025
work page 2025
-
[49]
Tabiclv2: A better, faster, scalable, and open tabular foundation model, 2026
Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabiclv2: A better, faster, scalable, and open tabular foundation model, 2026. URL https://arxiv.org/abs/ 2602.11139
-
[50]
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400. PMLR, 2019
work page 2019
-
[51]
Sentence-bert: Sentence embeddings using siamese bert- networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019
work page 2019
-
[52]
Rebecca Roelofs, Vaishaal Shankar, Benjamin Recht, Sara Fridovich-Keil, Moritz Hardt, John Miller, and Ludwig Schmidt. A meta-analysis of overfitting in machine learning.Advances in neural information processing systems, 32, 2019
work page 2019
-
[53]
Tabred: Analyzing pitfalls and filling the gaps in tabular deep learning benchmarks
Ivan Rubachev, Nikolay Kartashev, Yury Gorishniy, and Artem Babenko. Tabred: Analyzing pitfalls and filling the gaps in tabular deep learning benchmarks. InThe Thirteenth International Conference on Learning Representations, 2024
work page 2024
-
[54]
Tabrepo: A large scale repository of tabular model evaluations and its automl applications
David Salinas and Nick Erickson. Tabrepo: A large scale repository of tabular model evaluations and its automl applications. InAutoML Conference 2024 (ABCD Track), 2024
work page 2024
-
[55]
scikit-learn developers. Importance of feature scaling. https://scikit-learn.org/ stable/auto_examples/preprocessing/plot_scaling_importance.html, 2026. scikit-learn documentation, accessed April 2026
work page 2026
- [56]
-
[57]
Skrub software.https://skrub-data.org, 2026
Skrub. Skrub software.https://skrub-data.org, 2026
work page 2026
-
[58]
Marco Spinaci, Marek Polewczyk, Maximilian Schambach, and Sam Thelin. Contexttab: A semantics-aware tabular in-context learner.Advances in Neural Information Processing Systems, 39, 2025
work page 2025
-
[59]
Machine learning and big data: What is important? IEEE Data Eng
Michael Stonebraker and El Kindi Rezig. Machine learning and big data: What is important? IEEE Data Eng. Bull., 42(4):3–7, 2019
work page 2019
-
[60]
Mambular: A sequential model for tabular deep learning, 2025
Anton Frederik Thielmann, Manish Kumar, Christoph Weisser, Arik Reuter, Benjamin Säfken, and Soheila Samiee. Mambular: A sequential model for tabular deep learning, 2025. URL https://arxiv.org/abs/2408.06291
-
[61]
Openml: networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014
Joaquin Vanschoren, Jan N Van Rijn, Bernd Bischl, and Luis Torgo. Openml: networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014
work page 2014
-
[62]
Cambridge University Press, 2018
Roman Vershynin.High-Dimensional Probability. Cambridge University Press, 2018
work page 2018
-
[63]
Towards Universal Tabular Embeddings: A Benchmark Across Data Tasks
Liane V ogel, Kavitha Srinivas, Niharika D’Souza, Sola Shirai, Oktie Hassanzadeh, and Horst Samulowitz. Towards universal tabular embeddings: A benchmark across data tasks, 2026. URLhttps://arxiv.org/abs/2604.21696
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[64]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[65]
Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83,
Frank Wilcoxon. Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83,
-
[66]
URLhttp://www.jstor.org/stable/3001968
ISSN 00994987. URLhttp://www.jstor.org/stable/3001968
-
[67]
D.H. Wolpert and W.G. Macready. No free lunch theorems for optimization.IEEE Transactions on Evolutionary Computation, 1(1):67–82, 1997. doi: 10.1109/4235.585893
-
[68]
A closer look at deep learning methods on tabular datasets, 2025
Han-Jia Ye, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and De-Chuan Zhan. A closer look at deep learning methods on tabular datasets, 2025. URLhttps://arxiv.org/abs/2407.00956
- [69]
-
[70]
Learning task-agnostic representations through multi- teacher distillation
Dun Zhang, Ziyang Zeng, Yudong Zhou, and Shuyang Lu. Jasper-token-compression-600m technical report, 2025. URLhttps://arxiv.org/abs/2511.14405
-
[71]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embed- ding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025. 14 Appendices Contents A Detailed theoretical analysis 16 A.1 Problem setting an...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
The task is to predict the federal upper price limit
ACA Federal Upper Limits 5 Price limits for multi-source drugs under the Medicaid program. The task is to predict the federal upper price limit
-
[73]
The task is to predict worker salaries
AI/ML Salaries6 Salary and basic information for workers in the machine learning and data science industry. The task is to predict worker salaries
-
[74]
The task is to predict the severity of clinical signs
Animal and Veterinary Event7 Health problems reported in animals following the use of drug products. The task is to predict the severity of clinical signs
-
[75]
The task is to predict the height of the structures
Antenna Structure Registration8 FCC registration data for antenna structures. The task is to predict the height of the structures
-
[76]
The task is to predict the specific grant amount
Awarded Grants IMLS9 Grants awarded by the Institute of Museum and Library Services. The task is to predict the specific grant amount. 5https://www.medicaid.gov/medicaid/prescription-drugs/federal-upper-limit 6https://ai-jobs.net/salaries/download/salaries.csv 7https://open.fda.gov/apis/animalandveterinary/event/ 8https://hifld-geoplatform.opendata.arcgis...
-
[77]
The task is to predict overall review ratings
Beer Ratings10 Tasting profiles and consumer reviews for over 3,000 unique beers. The task is to predict overall review ratings
-
[78]
The task is to predict the maximum available download speed
Broadband Availability11 Data on internet speed and availability across the US. The task is to predict the maximum available download speed
-
[79]
The task is to predict median house prices
California Housing12 Median house values and demographics from the 1990 California census. The task is to predict median house prices
work page 1990
-
[80]
The task is to predict healthcare performance scores
Child Adult Healthcare Quality13 Quality of care metrics for Medicaid and CHIP benefi- ciaries. The task is to predict healthcare performance scores
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.