The Illusion of Generalization in Tabular Language Models

Aditya Gorla; Ratish Puduppully

arxiv: 2602.04031 · v2 · pith:MJL4FY3Pnew · submitted 2026-02-03 · 💻 cs.LG

The Illusion of Generalization in Tabular Language Models

Aditya Gorla , Ratish Puduppully This is my paper

classification 💻 cs.LG

keywords tabularclassificationdatasetsgeneralizationachieveclaimedevaluationfindings

0 comments

read the original abstract

Tabular Language Models (TLMs) have been claimed to achieve strong generalization for tabular prediction. We conduct a systematic re-evaluation of Tabula-8B as a representative TLM, utilizing 165 datasets from the UniPredict benchmark. Our investigation reveals three findings. First, binary and categorical classification achieve near-zero median lift over majority-class baselines and strong aggregate performance is driven entirely by quartile classification tasks. Second, top-performing datasets exhibit pervasive contamination, including complete train-test overlap and task-level leakage that evades standard deduplication. Third, instruction-tuning without tabular exposure recovers 92.2% of standard classification performance and on quartile classification, format familiarity closes 71.3% of the gap with the residual attributable to contaminated datasets. These findings suggest claimed generalization likely reflects evaluation artifacts rather than learned tabular reasoning. We conclude with recommendations for strengthening TLM evaluation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

STRABLE: Benchmarking Tabular Machine Learning with Strings
cs.LG 2026-05 unverdicted novelty 8.0

A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
cs.LG 2026-05 unverdicted novelty 7.0

MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
Towards Universal Tabular Embeddings: A Benchmark Across Data Tasks
cs.LG 2026-04 unverdicted novelty 6.0

TEmBed benchmark shows that the best tabular embedding model depends on the specific task and the representation level (cell, row, column, or table).