pith. sign in

arxiv: 2605.24417 · v2 · pith:AKVAT3NLnew · submitted 2026-05-23 · 💻 cs.LG

LLMTabBench: Evaluating LLMs on Binary Tabular Classification From Zero to Few Shots

Pith reviewed 2026-06-30 14:22 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLMstabular classificationzero-shotfew-shotbenchmarkin-context learningbinary classificationlow-data
0
0 comments X

The pith

Large language models can match or exceed few-shot performance on binary tabular classification using only zero-shot task descriptions, but additional examples sometimes degrade results due to conflicts with prior knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LLMTabBench to test how LLMs handle binary classification tasks on tabular data with little or no labeled data. It finds that models often perform well from zero-shot prompts alone because their pre-trained knowledge fits the tasks, and that adding a few examples can sometimes lower accuracy when those examples clash with what the model already knows. Performance also falls once the data becomes complex enough that the examples stop helping. This setup matters for applications where collecting labels is expensive or slow, showing when LLMs might replace traditional few-shot methods.

Core claim

LLMs achieve competitive or superior accuracy on binary tabular classification in zero-shot settings compared to few-shot, with examples sometimes conflicting with prior knowledge to reduce performance, and a complexity threshold beyond which LLM effectiveness drops and examples become less useful.

What carries the argument

LLMTabBench benchmark evaluating LLM interactions between prior knowledge, task descriptions, and in-context examples across real-world and synthetic datasets of varying complexity.

If this is right

  • Zero-shot LLM use can suffice or outperform few-shot for certain tabular problems.
  • In-context examples must be chosen to align with rather than contradict model knowledge.
  • There exists a data complexity level where LLM tabular classification stops scaling with more shots.
  • The benchmark reveals limits of in-context learning for tabular data in low-data regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prior knowledge in LLMs may serve a role similar to large-scale pretraining on synthetic data for tabular tasks.
  • Methods to mitigate knowledge conflicts could extend the usefulness of few-shot prompting.
  • The findings may generalize to multi-class or regression tasks on tabular data if similar patterns hold.
  • Testing on additional real-world domains could confirm the complexity threshold.

Load-bearing premise

The selected real-world and synthetic datasets with their task descriptions capture the typical interactions between LLM knowledge and examples in low-data tabular classification.

What would settle it

Observing consistent performance gains from additional examples without degradation across a broader range of tabular datasets or no clear drop at higher complexities would challenge the claims.

Figures

Figures reproduced from arXiv: 2605.24417 by Alina Kostromina, Aziz Temirkhanov, Daria Grushina, Dmitry Simakov, Kseniia Kuvshinova, Mile Mitrovic.

Figure 1
Figure 1. Figure 1: 0-shot LLM vs. 16-shot TabPFN: Average ROC-AUC [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Statistical Significance of LLM 0-shot ROC-AUC vs. Random Weights ROC-AUC [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LLM performance on real datasets under four prior-knowledge configurations in zero-shot and 64-shot settings. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: LLM Average ROC-AUC on Real Datasets vs.TabPFN [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: LLM Average ROC-AUC on LLM-Generated Datasets [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: LLM Average ROC-AUC on LLM-Generated Datasets [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-domain comparison: GPT-4o-mini at 0-shot, [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-domain mean ROC–AUC at 0-shot for the four [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prior-knowledge gradual increment on real [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 14
Figure 14. Figure 14: Same comparison as Figure 13 but at 16 shots. Few [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prior-knowledge gradual increment on real [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prior-knowledge gradual increment on real [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prior-knowledge gradual increment on real [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prior-knowledge gradual increment on real [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗
Figure 22
Figure 22. Figure 22: 0-shot per-domain radar on the LLM-synthetic [PITH_FULL_IMAGE:figures/full_fig_p023_22.png] view at source ↗
Figure 25
Figure 25. Figure 25: Prior-knowledge gradual increment on the LLM [PITH_FULL_IMAGE:figures/full_fig_p023_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Prior-knowledge gradual increment on LLM [PITH_FULL_IMAGE:figures/full_fig_p023_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Prior-knowledge gradual increment on LLM [PITH_FULL_IMAGE:figures/full_fig_p024_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Prior-knowledge gradual increment on LLM [PITH_FULL_IMAGE:figures/full_fig_p024_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Prior-knowledge gradual increment on LLM [PITH_FULL_IMAGE:figures/full_fig_p024_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Prior-knowledge gradual increment on LLM [PITH_FULL_IMAGE:figures/full_fig_p024_30.png] view at source ↗
read the original abstract

Supervised classification on tabular data remains a central machine learning task, but its dependence on large labeled datasets limits its applicability in data-scarce settings. Few-shot methods such as TabPFN achieve strong performance through large-scale synthetic pretraining, yet still require labeled context examples. Large Language Models (LLMs) offer a more flexible alternative through zero- and few-shot in-context learning from task descriptions, but their behavior on tabular data remains inconsistent. We introduce LLMTabBench, a benchmark for evaluating LLMs on tabular classification under low-data conditions. The benchmark studies how LLM prior knowledge interacts with task descriptions and few-shot examples, and how performance changes with increasing data complexity across real-world and controlled synthetic datasets. We find that LLMs can be highly competitive in zero-shot settings, sometimes outperforming models given few-shot examples. However, additional examples may conflict with prior knowledge, thereby degrading performance. We also observe a complexity threshold at which LLM performance declines and few-shot examples become less useful. These results clarify key limits of in-context learning for tabular data and inform the deployment of LLMs in low-data regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LLMTabBench, a benchmark for evaluating LLMs on binary tabular classification in zero- to few-shot regimes. It studies interactions between LLM prior knowledge, task descriptions, and in-context examples on real-world and synthetic datasets, claiming that LLMs are often competitive (sometimes superior) in zero-shot settings, that adding examples can degrade performance by conflicting with priors, and that a complexity threshold exists beyond which few-shot examples become less useful.

Significance. If the empirical patterns hold after addressing controls, the benchmark and findings would usefully document limits of in-context learning on tabular data and inform when zero-shot LLM use is preferable to few-shot in low-data settings. The inclusion of both real and controlled synthetic data is a positive design choice for isolating complexity effects.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (results on few-shot degradation): the mechanistic claim that performance drops occur because 'additional examples may conflict with prior knowledge' is not supported by controls that isolate this from confounds such as prompt length, token budget, or example noise. No label-flip or consistency-controlled experiments (holding features and length fixed while varying label agreement with priors) are described.
  2. [§3] §3 (dataset and task construction): the representativeness assumption underlying generalization of the complexity-threshold and prior-knowledge interaction findings is not justified. No analysis shows that the chosen real-world and synthetic datasets span the relevant distribution of tabular problems or that task descriptions are typical of deployment scenarios.
minor comments (2)
  1. [§2] Notation for zero-shot vs. few-shot prompting variants should be defined once in §2 and used consistently in all tables and figures.
  2. [Figures in §4] Figure captions for performance-vs-complexity plots should explicitly state the number of runs, error-bar definition, and whether statistical significance tests were applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major point below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] the mechanistic claim that performance drops occur because 'additional examples may conflict with prior knowledge' is not supported by controls that isolate this from confounds such as prompt length, token budget, or example noise. No label-flip or consistency-controlled experiments (holding features and length fixed while varying label agreement with priors) are described.

    Authors: We agree that the manuscript interprets the observed degradation in few-shot performance as potentially arising from conflicts with model priors, but does so on the basis of patterns across real and synthetic datasets rather than through experiments that hold prompt length and features fixed while varying label agreement. No label-flip or consistency-controlled ablations are present. We will revise the abstract and §4 to present the prior-conflict account as one plausible interpretation supported by the data, while explicitly acknowledging alternative explanations (including length and noise effects) and noting the absence of isolating controls. A short paragraph will be added to the discussion outlining how such controls could be implemented in follow-up work. revision: yes

  2. Referee: [§3] the representativeness assumption underlying generalization of the complexity-threshold and prior-knowledge interaction findings is not justified. No analysis shows that the chosen real-world and synthetic datasets span the relevant distribution of tabular problems or that task descriptions are typical of deployment scenarios.

    Authors: The datasets were selected to cover multiple real-world domains together with synthetic constructions that systematically vary complexity; however, the manuscript contains no formal analysis demonstrating that the collection is representative of the broader space of tabular classification tasks or that the task descriptions mirror typical deployment phrasing. We will revise §3 to clarify the selection rationale and will add a dedicated limitations subsection that discusses the scope of the benchmark, the generalizability of the reported thresholds and interactions, and the need for wider validation across additional tabular distributions and prompt styles. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivations or self-referential structure

full rationale

This paper introduces LLMTabBench as an empirical evaluation framework for LLMs on binary tabular classification in zero- and few-shot regimes. The central claims consist of observed performance patterns across real-world and synthetic datasets, with no equations, parameter fitting presented as prediction, uniqueness theorems, or ansatzes. All load-bearing statements are direct experimental outcomes rather than reductions to prior self-citations or input definitions. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark paper with no mathematical model, free parameters, axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5752 in / 1188 out tokens · 64915 ms · 2026-06-30T14:22:29.919482+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 14 canonical work pages · 7 internal anchors

  1. [1]

    Sercan Ö Arik and Tomas Pfister. 2021. Tabnet: Attentive interpretable tabular learning. InProceedings of the AAAI conference on artificial intelligence, Vol. 35. 6679–6687

  2. [2]

    Jonathan Bac, Evgeny M Mirkes, Alexander N Gorban, Ivan Tyukin, and Andrei Zinovyev. 2021. Scikit-dimension: a python package for intrinsic dimension estimation.Entropy23, 10 (2021), 1368

  3. [3]

    Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hut- ter, Michel Lang, Rafael Gomes Mantovani, Jan N van Rijn, and Joaquin Van- schoren. 2021. Openml: A benchmarking layer on top of openml to quickly create, download, and share systematic benchmarks.NeurIPS–Track on Datasets and Benchmarks(2021)

  4. [4]

    Kevin M Carter, Raviv Raich, and Alfred O Hero III. 2009. On local intrinsic dimension estimation and its applications.IEEE Transactions on Signal Processing 58, 2 (2009), 650–663

  5. [5]

    Jintai Chen, Kuanlun Liao, Yao Wan, Danny Z Chen, and Jian Wu. 2022. Danets: Deep abstract networks for tabular data classification and regression. InProceed- ings of the AAAI Conference on Artificial Intelligence, Vol. 36. 3930–3938

  6. [6]

    Tianqi Chen. 2016. XGBoost: A Scalable Tree Boosting System.Cornell University (2016)

  7. [7]

    Zerui Cheng, Jiashuo Liu, Jianzhu Yao, Pramod Viswanath, Ge Zhang, and Wen- hao Huang. 2026. TabularMath: Evaluating Computational Extrapolation in Tabular Learning via Program-Verified Synthesis.arXiv preprint arXiv:2602.02523 (2026)

  8. [8]

    Elizabeth R DeLong, David M DeLong, and Daniel L Clarke-Pearson. 1988. Com- paring the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach.Biometrics(1988), 837–845

  9. [9]

    Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Pra- teek Mutalik Desai, David Salinas, and Frank Hutter. 2025. Tabarena: A living benchmark for machine learning on tabular data.arXiv preprint arXiv:2506.16791 (2025)

  10. [10]

    Elena Facco, Maria d’Errico, Alex Rodriguez, and Alessandro Laio. 2017. Estimat- ing the intrinsic dimension of datasets by a minimal neighborhood information. Scientific reports7, 1 (2017), 12140

  11. [11]

    Mingyu Fan, Nannan Gu, Hong Qiao, and Bo Zhang. 2010. Intrinsic dimension estimation of data by principal component analysis.arXiv preprint arXiv:1002.2050 (2010)

  12. [12]

    Ronald A Fisher. 1922. On the interpretation of 𝜒 2 from contingency tables, and the calculation of P.Journal of the royal statistical society85, 1 (1922), 87–94

  13. [13]

    Ronald Aylmer Fisher. 1970. Statistical methods for research workers. InBreak- throughs in statistics: Methodology and distribution. Springer, 66–70

  14. [14]

    Keinosuke Fukunaga and David R Olsen. 1971. An algorithm for finding intrinsic dimensionality of data.IEEE Transactions on computers100, 2 (1971), 176–183

  15. [15]

    Josh Gardner, Juan C Perdomo, and Ludwig Schmidt. [n. d.]. Large scale trans- fer learning for tabular data via language modeling, 2024.URL https://arxiv. org/abs/2406.12031([n. d.])

  16. [16]

    Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. [n. d.]. TabM: Advancing tabular deep learning with parameter-efficient ensembling, 2025.URL https://arxiv. org/abs/2410.24210([n. d.])

  17. [17]

    Aditya Gorla and Ratish Puduppully. 2026. The Illusion of Generalization: Re- examining Tabular Language Model Evaluation.arXiv preprint arXiv:2602.04031 (2026)

  18. [18]

    Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, et al. 2025. Tabpfn-2.5: Advancing the state of the art in tabular foundation models.arXiv preprint arXiv:2511.08667(2025)

  19. [19]

    Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Sontag. 2023. Tabllm: Few-shot classification of tabular data with large language models. InInternational conference on artificial intelligence and statistics. PMLR, 5549–5581

  20. [20]

    Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. 2022. Tabpfn: A transformer that solves small tabular classification problems in a second.arXiv preprint arXiv:2207.01848(2022)

  21. [21]

    Xin Huang, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin. 2012. Tabtrans- former: tabular data modeling using contextual embeddings (2020).arXiv preprint arXiv:2012.06678(2012)

  22. [22]

    Sukriti Jaitly, Tanay Shah, Ashish Shugani, and Razik Singh Grewal. 2023. To- wards better serialization of tabular data for few-shot classification with large language models.arXiv preprint arXiv:2312.12464(2023)

  23. [23]

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems30 (2017)

  24. [24]

    Elizaveta Levina and Peter Bickel. 2004. Maximum likelihood estimation of intrinsic dimension.Advances in neural information processing systems17 (2004)

  25. [25]

    Ruixue Liu, Shaozu Yuan, Aijun Dai, Lei Shen, Tiangang Zhu, Meng Chen, and Xiaodong He. 2022. Few-Shot Table Understanding: A Benchmark Dataset and Pre-Training Baseline. InProceedings of the 29th International Conference on Computational Linguistics, Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu,...

  26. [26]

    Hariharan Manikandan, Yiding Jiang, and J Zico Kolter. 2023. Language models are weak learners.Advances in Neural Information Processing Systems36 (2023), 50907–50931

  27. [27]

    Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Ganesh Ramakrishnan, Micah Goldblum, and Colin White. 2023. When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems36 (2023), 76336–76369

  28. [28]

    Evgeny M Mirkes, Jeza Allohibi, and Alexander Gorban. 2020. Fractional norms and quasinorms do not help to overcome the curse of dimensionality.Entropy 22, 10 (2020), 1105

  29. [29]

    Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Doro- gush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical features.Advances in neural information processing systems31 (2018)

  30. [30]

    Jingang QU, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. 2025. TabICL: A Tabular Foundation Model for In-Context Learning on Large Data. In Forty-second International Conference on Machine Learning. https://openreview .net/forum?id=0VvD1PmNzM

  31. [31]

    Dylan Slack and Sameer Singh. 2023. Tablet: Learning from instructions for tabular data.arXiv preprint arXiv:2304.13188(2023)

  32. [32]

    Samuel A Stouffer, Edward A Suchman, Leland C DeVinney, Shirley A Star, and Robin M Williams Jr. 1949. The american soldier: Adjustment during army life.(studies in social psychology in world war ii), vol. 1. (1949)

  33. [33]

    Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and Dongmei Zhang. 2023. Gpt4table: Can large language models understand structured table data? a bench- mark and empirical study.arXiv preprint ArXiv:2305.13062(2023)

  34. [34]

    Dongjie Wang, Yanyong Huang, Wangyang Ying, Haoyue Bai, Nanxu Gong, Xinyuan Wang, Sixun Dong, Tao Zhe, Kunpeng Liu, Meng Xiao, et al. 2025. To- wards data-centric ai: A comprehensive survey of traditional, reinforcement, and generative approaches for tabular data transformation.arXiv preprint arXiv:2501.10555(2025)

  35. [35]

    Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xeron Du, Di Liang, Daixin Shu, Xianfu Cheng, Tianzhen Sun, et al. 2025. Tablebench: A com- prehensive and complex benchmark for table question answering. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 25497–25506

  36. [36]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025). A Datasets We use four dataset suites:(i)acore real-world suiteof 24 binary- classification tasks spanning eight knowledge domains (Section A.1); (ii)anew suiteof ...