pith. machine review for the scientific record. sign in

arxiv: 2604.21696 · v1 · submitted 2026-04-23 · 💻 cs.LG · cs.DB

Recognition: unknown

Towards Universal Tabular Embeddings: A Benchmark Across Data Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:23 UTC · model grok-4.3

classification 💻 cs.LG cs.DB
keywords tabular embeddingsbenchmarkrepresentation learningtabular datafoundation modelsembedding evaluationdata tasks
0
0 comments X

The pith

The best tabular embedding model depends on the task and representation level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TEmBed as a benchmark to test tabular embeddings at four levels: cell, row, column, and table. It runs a range of existing representation learning models through the benchmark and reports that top-performing models shift according to both the downstream task and the chosen representation level. This matters because tabular data drives many practical systems for retrieval, search, and prediction, and current foundation-model efforts need clearer selection rules. The results give immediate guidance on which embeddings to pick for a given use case while highlighting the remaining gap to truly universal tabular representations.

Core claim

Evaluating a diverse set of tabular representation learning models on the TEmBed benchmark across four representation levels shows that which model performs best depends on both the task and the representation level.

What carries the argument

TEmBed, the Tabular Embedding Test Bed, a systematic benchmark that evaluates embeddings at cell, row, column, and table levels on multiple tasks.

If this is right

  • Applications such as table retrieval and semantic search should select embeddings according to the required representation level rather than assuming one model fits all.
  • Tabular foundation model development must target generalization across both tasks and levels instead of optimizing for single settings.
  • The benchmark supplies a common test suite that allows direct comparison of future embedding methods.
  • Practitioners gain concrete rules for choosing among existing models based on their concrete data task and granularity needs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dependence on level suggests that future models could benefit from explicit mechanisms to combine or switch between cell-level and table-level signals.
  • Extending TEmBed with additional domains or tasks outside its current suite would test whether the observed pattern holds more broadly.
  • The benchmark framing could be reused for other data modalities to check whether representation universality is similarly task-dependent.

Load-bearing premise

The tasks, datasets, and four representation levels chosen for TEmBed are representative of real-world tabular applications so that the observed model rankings generalize.

What would settle it

A new model that ranks first on every task and every representation level inside the TEmBed suite would contradict the claim that performance depends on task and level.

Figures

Figures reproduced from arXiv: 2604.21696 by Horst Samulowitz, Kavitha Srinivas, Liane Vogel, Niharika D'Souza, Oktie Hassanzadeh, Sola Shirai.

Figure 1
Figure 1. Figure 1: Overview of TEmBed. We systematically evaluate [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of row similarity search results aggre [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Triplet-Based Row Evaluations based on Tables and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Tabular Prediction. Results are shown as percentage [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Column Similarity Search: Results per dataset. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cell Level Semantic Retrieval. Evaluation Methodology For each test case, embeddings are gen￾erated for all cells of the test case tables, including header cells. Next, we compute cosine similarity between the query cell embedding and all candidate cell embeddings, excluding the query cell itself. The top-k most similar cells are retrieved, where k corresponds to the number of ground-truth cells associated… view at source ↗
Figure 7
Figure 7. Figure 7: Row Similarity Search - Resource Consumption. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cell Semantic Retrieval: Execution time compared [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
read the original abstract

Tabular foundation models aim to learn universal representations of tabular data that transfer across tasks and domains, enabling applications such as table retrieval, semantic search and table-based prediction. Despite the growing number of such models, it remains unclear which approach works best in practice, as existing methods are often evaluated under task-specific settings that make direct comparison difficult. To address this, we introduce TEmBed, the Tabular Embedding Test Bed, a comprehensive benchmark for systematically evaluating tabular embeddings across four representation levels: cell, row, column, and table. Evaluating a diverse set of tabular representation learning models, we show that which model to use depends on the task and representation level. Our results offer practical guidance for selecting tabular embeddings in real-world applications and lay the groundwork for developing more general-purpose tabular representation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces TEmBed, a benchmark for tabular embeddings evaluated at cell, row, column, and table representation levels across multiple data tasks. It evaluates various tabular representation learning models and concludes that the optimal model depends on the task and the representation level, offering practical guidance for selecting embeddings in applications such as table retrieval, semantic search, and prediction.

Significance. If the benchmark tasks and datasets prove representative of real-world tabular applications, the results would provide actionable insights for practitioners choosing among tabular embedding models and could stimulate development of more general-purpose models. The benchmark's independence from any single model's performance is a strength, as it avoids circularity in the evaluation.

major comments (1)
  1. [Abstract] Abstract: The central claim that 'which model to use depends on the task and representation level' is load-bearing for the paper's contribution, yet the abstract (and by extension the reported experimental design) provides no details on dataset selection criteria, statistical significance testing, or controls for confounding factors such as model size or training data overlap. Without these, it is impossible to determine whether the observed task- and level-dependent rankings are robust or artifacts of benchmark composition.
minor comments (1)
  1. [Abstract] Abstract: The description of evaluating 'a diverse set' of models and tasks would benefit from quantitative coverage metrics (e.g., number of datasets per task type, domain distribution) to support the representativeness assumption.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback. We address the major comment point by point below, drawing on details from the full manuscript while remaining honest about its current scope.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'which model to use depends on the task and representation level' is load-bearing for the paper's contribution, yet the abstract (and by extension the reported experimental design) provides no details on dataset selection criteria, statistical significance testing, or controls for confounding factors such as model size or training data overlap. Without these, it is impossible to determine whether the observed task- and level-dependent rankings are robust or artifacts of benchmark composition.

    Authors: We agree that the abstract is concise and omits these methodological details, which are important for assessing robustness. However, the full experimental design in the manuscript does address them. Section 3.1 specifies dataset selection criteria: we curated 15 datasets chosen for domain diversity (finance, healthcare, retail, science), size variation (from 1k to 500k rows), and task coverage (classification, regression, retrieval) to reduce composition artifacts and improve representativeness. Section 4.3 reports statistical significance via paired t-tests and 95% bootstrap confidence intervals computed over 5 random seeds per experiment, with p-values provided for all key comparisons. For confounding factors, Table 1 lists parameter counts for each model (ranging 10M–300M) and we include an ablation in Section 5.2 showing that performance rankings persist after normalizing for size; training data overlap is discussed in the limitations paragraph of Section 6, where we note that while some pretraining corpora may intersect, the evaluation uses held-out downstream tasks and zero-shot protocols to focus on transfer. These elements indicate the task- and level-dependent findings are not artifacts. To improve clarity, we will add a brief clause to the abstract referencing the benchmark's scale and statistical controls. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are independent of inputs

full rationale

The paper defines TEmBed as a new benchmark spanning four representation levels and multiple tasks, then reports empirical performance of existing models on it. The claim that model selection depends on task and level follows directly from those independent evaluations rather than any self-definitional loop, fitted parameter renamed as prediction, or self-citation chain. No equations or derivations appear that reduce reported rankings to quantities constructed inside the paper itself. The benchmark construction and experimental protocol stand as self-contained external evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the selected tasks and models form a fair and representative test bed; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Existing tabular representation models can be evaluated under a common set of tasks and metrics without task-specific retraining or hyperparameter tuning that would favor one model.
    Invoked when claiming that observed performance differences reflect intrinsic model quality rather than evaluation choices.
invented entities (1)
  • TEmBed benchmark no independent evidence
    purpose: Unified test bed for comparing tabular embeddings at four representation levels
    New artifact introduced by the paper; no independent evidence outside this work is provided.

pith-pipeline@v0.9.0 · 5449 in / 1252 out tokens · 20467 ms · 2026-05-09T22:23:55.312274+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. STRABLE: Benchmarking Tabular Machine Learning with Strings

    cs.LG 2026-05 unverdicted novelty 8.0

    A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.

Reference graph

Works this paper leans on

54 extracted references · 26 canonical work pages · cited by 1 Pith paper

  1. [1]

    arXiv preprint arXiv:2508.21085 , year =

    Parul Awasthy, Aashka Trivedi, Yulong Li, Meet Doshi, Riyaz A. Bhat, Vignesh P, Vishwajeet Kumar, Yushu Yang, Bhavani Iyer, Abraham Daniels, Rudra Murthy, Ken Barker, Martin Franz, Madison Lee, Todd Ward, Salim Roukos, David Cox, Luis A. Lastras, Jaydeep Sen, and Radu Florian. 2025. Granite Embedding R2 Models.CoRRabs/2508.21085 (2025). https://doi.org/10...

  2. [2]

    Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. 2023. Transformers for Tabular Data Representation: A Survey of Models and Applications.Trans. Assoc. Comput. Linguistics11 (2023), 227–249. https://doi.org/10.1162/TACL_A_00544

  3. [3]

    Muhammad Imam Luthfi Balaka, David Alexander, Qiming Wang, Yue Gong, Adila Krisnadhi, and Raul Castro Fernandez. 2025. Pneuma: Leveraging LLMs for Tabular Data Representation and Retrieval in an End-to-End System.Proc. ACM Manag. Data3, 3 (2025), 200:1–200:28. https://doi.org/10.1145/3725337

  4. [4]

    Jan-Micha Bodensohn, Ulf Brackmann, Liane Vogel, Anupam Sanghi, and Carsten Binnig. 2025. Unveiling Challenges for LLMs in Enterprise Data Engineering. Proc. VLDB Endow.19, 2 (2025), 196–209. https://www.vldb.org/pvldb/vol19/p196- bodensohn.pdf

  5. [5]

    Allaa Boutaleb, Bernd Amann, Hubert Naacke, and Rafael Angarita. 2025. Some- thing’s Fishy in the Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks. InProceedings of the 4th Table Representation Learning Work- shop, Shuaichen Chang, Madelon Hulsebos, Qian Liu, Wenhu Chen, and Huan Sun (Eds.). Association for Computational Linguistics, V...

  6. [6]

    Pei Chen, Soumajyoti Sarkar, Leonard Lausen, Balasubramaniam Srinivasan, Sheng Zha, Ruihong Huang, and George Karypis. 2023. HyTrel: Hypergraph- enhanced Tabular Data Representation Learning. InAdvances in Neural Infor- mation Processing Systems 36: Annual Conference on Neural Information Pro- cessing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, Dece...

  7. [7]

    Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu C. Aggarwal, Dou Shen, and Rajeev Rastogi (Eds.). ACM, 785–794. https:/...

  8. [8]

    Tianji Cong, James Gale, Jason Frantz, H. V. Jagadish, and Çagatay Demiralp. 2023. WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses. In 13th Conference on Innovative Data Systems Research, CIDR 2023, Amsterdam, The Netherlands, January 8-11, 2023. www.cidrdb.org. https://vldb.org/cidrdb/2023/ warpgate-a-semantic-join-discovery-system-f...

  9. [9]

    Tianji Cong, Madelon Hulsebos, Zhenjie Sun, Paul Groth, and H. V. Jagadish

  10. [10]

    VLDB Endow.17, 4 (2023), 849–862

    Observatory: Characterizing Embeddings of Relational Tables.Proc. VLDB Endow.17, 4 (2023), 849–862. https://doi.org/10.14778/3636218.3636237

  11. [11]

    Roni Copul, Nave Frost, Tova Milo, and Kathy Razmadze. 2024. TabEE: Tabular Embeddings Explanations.Proc. ACM Manag. Data2, 1 (2024), 72:1–72:26. https: //doi.org/10.1145/3639329

  12. [12]

    Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2020. TURL: Table Understanding through Representation Learning.Proc. VLDB Endow.14, 3 (2020), 307–319. https://doi.org/10.5555/3430915.3442430

  13. [13]

    Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Pra- teek Mutalik Desai, David Salinas, and Frank Hutter. 2025. TabArena: A Liv- ing Benchmark for Machine Learning on Tabular Data. InProceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS). https: //openreview.net/pdf?id=jZqCqpCLdU

  14. [14]

    Javier Flores, Sergi Nadal, and Oscar Romero. 2021. Effective and Scalable Data Discovery with NextiaJD. InProceedings of the 24th International Conference on Extending Database Technology, EDBT 2021, Nicosia, Cyprus, March 23 - 26, 2021, Yannis Velegrakis, Demetris Zeinalipour-Yazti, Panos K. Chrysanthis, and Francesco Guerra (Eds.). OpenProceedings.org,...

  15. [15]

    Perdomo, and Ludwig Schmidt

    Josh Gardner, Juan C. Perdomo, and Ludwig Schmidt. 2024. Large Scale Transfer Learning for Tabular Data via Language Modeling. InAd- vances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, Amir Globersons, Lester Mackey, Danielle Belgr...

  16. [16]

    Aditya Gorla and Ratish Puduppully. 2026. The Illusion of Generalization: Re- examining Tabular Language Model Evaluation.arXiv preprint arXiv:2602.04031 (2026)

  17. [17]

    Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. 2022. Why do tree-based models still outperform deep learning on typical tabular data?. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, Sanmi Koyejo, S. Mohamed, A...

  18. [18]

    Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Martin Eisenschlos. 2020. TaPas: Weakly Supervised Table Parsing via Pre-training. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetrea...

  19. [19]

    Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. 2025. Accurate predictions on small data with a tabular foundation model.Nature(09 01 2025). https://doi.org/10.1038/s41586-024-08328-6

  20. [20]

    Frederik Hoppe, Lars Kleinemeier, Astrid Franz, and Udo Göbel. 2025. Comparing Task-Agnostic Embedding Models for Tabular Data.AI for Tabular Data Workshop @EurIPS )abs/2511.14276 (2025). https://doi.org/10.48550/ARXIV.2511.14276 arXiv:2511.14276

  21. [21]

    Madelon Hulsebos, Çagatay Demiralp, and Paul Groth. 2023. Gittables: A large- scale corpus of relational tables.Proceedings of the ACM on Management of Data 1, 1 (2023), 1–17

  22. [22]

    Hiroshi Iida, Dung Thai, Varun Manjunatha, and Mohit Iyyer. 2021. TABBIE: Pretrained Representations of Tabular Data. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer...

  23. [23]

    Parameswaran, and Madelon Hulsebos

    Xingyu Ji, Parker Glenn, Aditya G. Parameswaran, and Madelon Hulsebos. 2024. TARGET: Benchmarking Table Retrieval for Generative Tasks.NeurIPS 2024 Third Table Representation Learning Workshop(2024). https://openreview.net/ pdf?id=gGGvnjFUfL

  24. [24]

    Arlind Kadra, Marius Lindauer, Frank Hutter, and Josif Grabocka. 2021. Well- tuned Simple Nets Excel on Tabular Datasets. InAdvances in Neural Informa- tion Processing Systems 34: Annual Conference on Neural Information Process- ing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio Ran- zato, Alina Beygelzimer, Yann N. Dauphin, Percy ...

  25. [25]

    Myung Jun Kim, Léo Grinsztajn, and Gaël Varoquaux. 2024. CARTE: Pretraining and Transfer for Tabular Learning. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 (Proceedings of Machine Learning Research), Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Sca...

  26. [26]

    Myung Jun Kim, Félix Lefebvre, Gaëtan Brison, Alexandre Perez-Lebel, and Gaël Varoquaux. 2025. Table Foundation Models: on knowledge pre-training for tabular learning.Trans. Mach. Learn. Res.2025 (2025). https://openreview.net/ forum?id=QV4P8Csw17

  27. [27]

    Harsha Kokel, Aamod Khatiwada, Tejaswini Pedapati, Haritha Ananthakr- ishnan, Oktie Hassanzadeh, Horst Samulowitz, and Kavitha Srinivas. 2025. Evaluating Joinable Column Discovery Approaches for Context-Aware Search. arXiv:2510.24599 [cs.DB] https://arxiv.org/abs/2510.24599

  28. [28]

    Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems.Proc. VLDB Endow.3, 1 (2010), 484–493. https://doi.org/10.14778/1920841.1920904

  29. [29]

    Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, and Asterios Katsifodi- mos. 2021. Valentine: Evaluating Matching Techniques for Dataset Discovery. In 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19-22, 2021. IEEE, 468–479. ht...

  30. [30]

    Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, and Lei Li

  31. [31]

    On the Sentence Embeddings from Pre-trained Language Models

    On the Sentence Embeddings from Pre-trained Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 9119–9130. https: //doi.org/10.18653/v1/2020.emnlp-main.733 13

  32. [32]

    Peng Li, Yeye He, Dror Yashar, Weiwei Cui, Song Ge, Haidong Zhang, Danielle Rifinski Fainman, Dongmei Zhang, and Surajit Chaudhuri. 2024. Table- GPT: Table Fine-tuned GPT for Diverse Table Tasks.Proc. ACM Manag. Data2, 3 (2024), 176. https://doi.org/10.1145/3654979

  33. [33]

    Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, Huai-Hong Yin, Tao Zhou, Jun-Peng Jiang, and Han-Jia Ye. 2025. TALENT: A tabular analytics and learning toolbox. Journal of Machine Learning Research26, 226 (2025), 1–16

  34. [34]

    Yuze Lou, Bailey Kuehl, Erin Bransom, Sergey Feldman, Aakanksha Naik, and Doug Downey. 2023. S2abEL: A Dataset for Entity Linking from Scientific Tables. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 3089–3101

  35. [35]

    Meta Llama 3 Team. 2024. Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date. https://ai.meta.com/blog/meta-llama-3/. Meta AI Blog, accessed Feb 2026

  36. [36]

    Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. InProceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, Gautam Da...

  37. [37]

    Niklas Muennighoff, SU Hongjin, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. 2025. Generative representational instruc- tion tuning. InThe Thirteenth International Conference on Learning Representa- tions

  38. [38]

    Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2023. Mteb: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2014–2037

  39. [39]

    Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. 2025. TabICL: A Tabular Foundation Model for In-Context Learning on Large Data. In Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025. OpenReview.net. https://openreview.net/forum? id=0VvD1PmNzM

  40. [40]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embed- dings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vince...

  41. [41]

    Ivan Rubachev, Nikolay Kartashev, Yury Gorishniy, and Artem Babenko. 2025. TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Bench- marks. InThe Thirteenth International Conference on Learning Representations

  42. [42]

    Alieh Saeedi, Eric Peukert, and Erhard Rahm. 2017. Comparative Evaluation of Distributed Clustering Schemes for Multi-source Entity Resolution. InAdvances in Databases and Information Systems - 21st European Conference, ADBIS 2017, Nicosia, Cyprus, September 24-27, 2017, Proceedings (Lecture Notes in Computer Science), Marite Kirikova, Kjetil Nørvåg, and ...

  43. [43]

    Marco Spinaci, Marek Polewczyk, Maximilian Schambach, and Sam Thelin. 2025. ConTextTab: A Semantics-Aware Tabular In-Context Learner. InAdvances in Neural Information Processing Systems (NeurIPS). https://openreview.net/forum? id=kGMRb4jbTP

  44. [44]

    Kavitha Srinivas, Julian Dolby, Ibrahim Abdelaziz, Oktie Hassanzadeh, Har- sha Kokel, Aamod Khatiwada, Tejaswini Pedapati, Subhajit Chaudhury, and Horst Samulowitz. 2023. LakeBench: Benchmarks for Data Discovery over Data Lakes.CoRRabs/2307.04217 (2023). https://doi.org/10.48550/ARXIV.2307.04217 arXiv:2307.04217

  45. [45]

    Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, Çagatay Demiralp, Chen Chen, and Wang-Chiew Tan. 2022. Annotating Columns with Pre-trained Lan- guage Models. InSIGMOD ’22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Zachary G. Ives, Angela Bonifati, and Amr El Abbadi (Eds.). ACM, 1493–1503. https://doi.o...

  46. [46]

    Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and Dongmei Zhang. 2024. Table meets llm: Can large language models understand structured table data? a benchmark and empirical study. InProceedings of the 17th ACM International Conference on Web Search and Data Mining. 645–654

  47. [47]

    Liane Vogel, Jan-Micha Bodensohn, and Carsten Binnig. 2024. WikiDBs: A Large-Scale Corpus Of Relational Databases From Wikidata. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/pdf?id=abXaOcvujs

  48. [48]

    Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase.Commun. ACM57, 10 (2014), 78–85

  49. [49]

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre- Trained Transformers. InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle,...

  50. [50]

    Jing Wu, Suiyao Chen, Qi Zhao, Renat Sergazinov, Chen Li, Shengjie Liu, Chongchao Zhao, Tianpei Xie, Hanqing Guo, Cheng Ji, Daniel Cociorva, and Hakan Brunzell. 2024. SwitchTab: Switched Autoencoders Are Effective Tabu- lar Learners. InThirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications o...

  51. [51]

    Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, and HV Jagadish. 2025. Mmtu: A mas- sive multi-task table understanding and reasoning benchmark.arXiv preprint arXiv:2506.05587(2025)

  52. [52]

    Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Associatio...

  53. [53]

    col1: val1 | col2: val2

    Tianshu Zhang, Xiang Yue, Yifei Li, and Huan Sun. 2024. TableLlama: Towards Open Large Generalist Models for Tables. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, Kevin Duh, Helena G...

  54. [54]

    During pre-processing, categorical columns are converted into numerical codes via label encoding with missing values filled by -1 and NaNs set to 0

    during the forward pass. During pre-processing, categorical columns are converted into numerical codes via label encoding with missing values filled by -1 and NaNs set to 0. TabPFNv2.5 also has a constraint in terms of allowing a maximum of 10000 training samples and 100 columns, due to which some evaluations cannot be completed. Extracting Embeddings.For...