arxiv: 2605.06290 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

Data Language Models: A New Foundation Model Class for Tabular Data

Eda Erol , Giuliano Pezzoli , Ozer Cem Kelahmet

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:00 UTC · model grok-4.3

classification 💻 cs.AI

keywords data language modelstabular foundation modelsnative tabular understandingmissing value imputationrow-level predictionpreprocessing eliminationschema-1tabular data

0 comments

The pith

Data Language Models understand tables natively from raw cell values without serialization or preprocessing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a new class of foundation model for tabular data, which currently requires preprocessing in every existing approach from trees to language models. A Data Language Model processes tables directly from raw cells the way language models process sentences from tokens. Schema-1, the first 140M-parameter example trained on over 2.3 million datasets, shows better row-level prediction accuracy than gradient boosting, AutoML, and prior tabular models. It also reconstructs missing values more accurately than statistical methods or large language models and identifies the industry sector of unseen tables from cell values alone. This matters because tabular data drives many high-stakes decisions, and removing the preprocessing barrier would let AI systems consume raw tables directly.

Core claim

We introduce the Data Language Model (DLM) as the missing foundation model for tabular data. A DLM understands tables natively, without serialization or preprocessing, directly from raw cell values. Schema-1, a 140M parameter model trained on more than 2.3M synthetic and real-world tabular datasets, outperforms gradient-boosted ensembles, AutoML stacks, and the tabular foundation models we evaluate on established row-level prediction benchmarks. On missing value reconstruction it achieves lower reconstruction error than all classical statistical methods and frontier large language models on mean performance across conditions. It identifies the industry sector of any unseen dataset from raw 0

What carries the argument

The Data Language Model (DLM), an architecture that ingests raw cell values directly and learns the structural and distributional geometry of tables as a native modality.

If this is right

Tabular AI systems can be built without any preprocessing or serialization step between raw data and the model.
Missing-value imputation can rely on a table's own distributional geometry instead of external world knowledge.
Tasks such as automatic industry-sector identification become possible on completely unseen datasets from cell values alone.
The DLM can serve as the base layer for agents and vertical applications that consume raw tabular data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If DLMs become standard, multimodal models could combine text and tables without format conversion layers.
The approach implies that other structured data types, such as time series or graphs, might admit similar native foundation models.
Widespread adoption would reduce the custom data engineering currently required to feed tables into AI systems.
Testing DLMs on streaming or extremely wide tables would check whether the native understanding scales beyond the training distribution.

Load-bearing premise

Training on a large mix of synthetic and real tabular datasets is enough for a model to develop native structural understanding that works on raw cell values without any preprocessing pipeline.

What would settle it

Evaluating Schema-1 or a comparable DLM on a fresh, diverse set of tabular benchmarks and finding that it does not achieve lower prediction error than gradient-boosted trees or AutoML stacks, or higher imputation error than simple statistical baselines.

Figures

Figures reproduced from arXiv: 2605.06290 by Eda Erol, Giuliano Pezzoli, Ozer Cem Kelahmet.

**Figure 1.** Figure 1: The standard tabular ML pipeline. Domain specification is a manual human step: before view at source ↗

**Figure 2.** Figure 2: Builder pipelines for tabular foundation models and LLM-based approaches. Tabular view at source ↗

**Figure 3.** Figure 3: Schema-1 as the foundation model for vertical and agentic AI. Multiple raw enterprise view at source ↗

**Figure 4.** Figure 4: Mean ROC-AUC on OpenML-CC18 (18 datasets, 10-fold CV). Schema-1 (rightmost, dark) view at source ↗

**Figure 5.** Figure 5: Mean ROC-AUC as a function of MCAR missingness rate, averaged over 15 CC18 datasets. view at source ↗

**Figure 6.** Figure 6: Mean NRMSE across nine missingness conditions (lower is better). Three tiers: TabPFN view at source ↗

**Figure 7.** Figure 7: NRMSE by condition for four representative models (MCAR and MAR left, MNAR view at source ↗

**Figure 8.** Figure 8: Mean ROC-AUC under three column-name conditions, averaged over 20 OpenML numeri view at source ↗

**Figure 9.** Figure 9: Sector classification outcomes across 500 held-out datasets (10,000-class task). 457 datasets view at source ↗

read the original abstract

Every major data modality now has a foundation model that understands it natively: text has language models, images have vision models, audio has audio models. Tabular data, the modality on which many consequential real-world AI decisions are made, does not. Every approach to tabular AI today, from gradient-boosted trees to the latest tabular foundation models, requires a preprocessing pipeline before any model can consume the data. None of them understand tabular data as a modality. We introduce the Data Language Model (DLM), the missing foundation model for tabular data. A DLM understands tables the way a language model understands sentences: natively, without serialization or preprocessing, directly from raw cell values. It is the tabular data layer on which AI models, agents, and vertical AI applications can be built, eliminating the preprocessing pipelines that currently stand between raw data and every AI system that consumes it. We present Schema-1, the first DLM: a 140M parameter model trained on more than 2.3M synthetic and real-world tabular datasets. Schema-1 outperforms gradient-boosted ensembles, AutoML stacks, and the tabular foundation models we evaluate on established row-level prediction benchmarks. On missing value reconstruction it achieves lower reconstruction error than all classical statistical methods and frontier large language models on mean performance across conditions, establishing that structural understanding of a dataset's own distributional geometry is more useful for imputation than world knowledge encoded in language. It identifies the industry sector of any unseen dataset from raw cell values alone, reliably across any domain, a task no prior tabular model can perform. It is the native tabular understanding layer that has been missing from the AI stack.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The DLM idea is novel but the native no-preprocessing claim needs explicit input encoding details to stand up.

read the letter

The main point is that this paper frames tabular data as its own modality with a foundation model that ingests raw cell values directly, no serialization or feature engineering required. Schema-1 is a 140M parameter model trained on over 2.3 million datasets, and the abstract says it beats gradient-boosted trees, AutoML, and existing tabular foundation models on row-level prediction while also doing better imputation than classical stats or LLMs and identifying the industry sector of unseen tables from raw values alone. That sector task is a clean new demonstration of structural understanding that prior work has not shown. The positioning is useful too: it clearly states that every current tabular approach still needs a preprocessing layer and positions the DLM as the missing native layer. If the architecture delivers on the claims, it could reduce friction in real deployments where data arrives messy. The training scale gives it a shot at learning general table patterns rather than task-specific ones. The soft spot is the central claim itself. Any model must map heterogeneous cells (numbers, categories, dates, missings) into fixed representations, and the stress-test concern holds: without a precise description of the inference-time encoding, it is not obvious that Schema-1 avoids steps that could be added to baselines or that the training corpus cleaning does not leak into the advantage. The abstract also gives no numbers, benchmarks, or variance, so the outperformance statements cannot be judged yet. This is for researchers working on tabular foundation models or practical data pipelines who want to test whether a new architecture can skip the usual prep. It deserves peer review because the core idea is distinct from existing lines and the empirical tasks are falsifiable, even if the methods section will need to clarify the input handling and supply the missing metrics.

Referee Report

2 major / 1 minor

Summary. The paper introduces Data Language Models (DLMs) as a new foundation model class for tabular data. It presents Schema-1, a 140M-parameter model trained on more than 2.3M synthetic and real-world tabular datasets, claiming that it understands tables natively without serialization or preprocessing, directly from raw cell values. Schema-1 is reported to outperform gradient-boosted ensembles, AutoML stacks, and other tabular foundation models on row-level prediction benchmarks, achieve lower reconstruction error on missing-value imputation than classical methods and LLMs, and identify the industry sector of unseen datasets from raw cell values alone.

Significance. If the native tabular understanding claim holds, this would be a notable advance by supplying the missing foundation model layer for tabular data, potentially removing preprocessing pipelines that currently separate raw tables from AI systems. The large training corpus of 2.3M datasets and the introduction of a dataset-level classification task (industry sector identification) are concrete strengths that could support broader adoption if the empirical results are robust.

major comments (2)

[Abstract] Abstract: The central claim that a DLM 'understands tables the way a language model understands sentences: natively, without serialization or preprocessing, directly from raw cell values' is load-bearing for the paper's contribution and novelty. However, the manuscript provides no description of the input encoding that maps heterogeneous raw cells (numeric, categorical, datetime, missing values, variable schemas) into the model's representation. Without an explicit architecture section detailing this mechanism and demonstrating that it introduces no implicit preprocessing at inference time, the claimed advantage over gradient-boosted trees or existing tabular foundation models cannot be evaluated.
[Abstract] Abstract: Performance claims (outperformance on row-level prediction benchmarks, lower mean reconstruction error on imputation, reliable industry-sector identification) are stated without any quantitative metrics, error bars, specific benchmark names, or table references. The soundness assessment requires the full results section (including exact datasets, metrics, and baselines) to determine whether the evidence supports the claims; the current abstract formulation leaves the central empirical assertions unverifiable.

minor comments (1)

The abstract refers to 'established row-level prediction benchmarks' without naming them; adding the specific benchmark names and a one-sentence summary of key metrics would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment point by point below and will revise the abstract to improve clarity and verifiability while preserving its brevity.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that a DLM 'understands tables the way a language model understands sentences: natively, without serialization or preprocessing, directly from raw cell values' is load-bearing for the paper's contribution and novelty. However, the manuscript provides no description of the input encoding that maps heterogeneous raw cells (numeric, categorical, datetime, missing values, variable schemas) into the model's representation. Without an explicit architecture section detailing this mechanism and demonstrating that it introduces no implicit preprocessing at inference time, the claimed advantage over gradient-boosted trees or existing tabular foundation models cannot be evaluated.

Authors: The full manuscript contains an explicit architecture section (Section 3) that describes the input encoding in detail: raw cell values are processed natively via a schema-aware tokenizer and embedding layer that handles numeric, categorical, datetime, and missing values directly from the table structure, with no serialization or external preprocessing applied at inference time. This mechanism is what enables the claimed native understanding. We agree, however, that the abstract does not summarize this encoding, which may have obscured its presence. In revision we will add a brief, self-contained description of the input encoding to the abstract. revision: partial
Referee: [Abstract] Abstract: Performance claims (outperformance on row-level prediction benchmarks, lower mean reconstruction error on imputation, reliable industry-sector identification) are stated without any quantitative metrics, error bars, specific benchmark names, or table references. The soundness assessment requires the full results section (including exact datasets, metrics, and baselines) to determine whether the evidence supports the claims; the current abstract formulation leaves the central empirical assertions unverifiable.

Authors: We agree that the abstract would be stronger with selected quantitative anchors. The full results section (Section 4) reports all requested details: exact benchmark datasets and names, mean performance with standard deviations across repeated runs, and direct comparisons to the listed baselines, with references to the corresponding tables and figures. We will revise the abstract to include a small number of key quantitative highlights (e.g., average improvement margins and imputation error reductions) together with pointers to the results tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical training and benchmarks

full rationale

The paper presents Schema-1 as a 140M-parameter model trained on 2.3M external synthetic and real-world tabular datasets, then evaluated on row-level prediction benchmarks, missing-value reconstruction, and industry-sector identification tasks. The central claim of 'native' tabular understanding without preprocessing is framed as an empirical outcome (outperformance vs. gradient-boosted ensembles, AutoML, and other tabular foundation models) rather than a self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equations, ansatzes, or derivation steps appear in the abstract or described manuscript that reduce the result to its own inputs by construction. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that tabular data possesses inherent structure that can be learned directly from raw cells, plus the practical choice of training scale and model size.

free parameters (1)

140M parameter count
Architecture size selected for Schema-1; the claim depends on this scale being sufficient for native tabular understanding.

axioms (1)

domain assumption Tabular data can be understood natively like language without any preprocessing or serialization step
Invoked throughout the abstract as the core premise enabling the DLM class.

invented entities (1)

Data Language Model (DLM) no independent evidence
purpose: New foundation model class for direct tabular data understanding
Introduced as the missing native layer for tabular data; no independent falsifiable evidence provided beyond the model's claimed performance.

pith-pipeline@v0.9.0 · 5605 in / 1379 out tokens · 70019 ms · 2026-05-08T10:00:53.306760+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 3 internal anchors

[1]

arXiv preprint arXiv:1708.03731 , year=

URL https://arxiv.org/abs/1708.03731. 19 Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. Deep neural networks and tabular data: A survey.IEEE Transactions on Neural Networks and Learning Systems,

work page arXiv
[2]

URL https: //arxiv.org/abs/2110.01889

doi: 10.1109/TNNLS.2022.3229161. URL https: //arxiv.org/abs/2110.01889. Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InACM SIGKDD International Conference on Knowledge Discovery and Data Mining,

work page doi:10.1109/tnnls.2022.3229161 2022
[4]

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

URLhttps://arxiv.org/abs/2003.06505. Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. Auto- Sklearn 2.0: Hands-free AutoML via meta-learning.Journal of Machine Learning Research, 23 (261):1–61,

work page internal anchor Pith review arXiv 2003
[5]

Feurer, K

URLhttps://arxiv.org/abs/2007.04074. Josh Gardner, Juan C. Perdomo, and Ludwig Schmidt. Large scale transfer learning for tabular data via language modeling. InAdvances in Neural Information Processing Systems (NeurIPS),

work page arXiv 2007
[6]

doi:10.48550/arXiv.2406.12031,arXiv:2406.12031

URLhttps://arxiv.org/abs/2406.12031. Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on tabular data? InAdvances in Neural Information Processing Systems (NeurIPS),

work page arXiv
[7]

Grinsztajn, E

URLhttps://arxiv.org/abs/2207.08815. Léo Grinsztajn et al. TabPFN-2.5: Advancing the state of the art in tabular foundation models. arXiv preprint arXiv:2511.08667,

work page arXiv
[8]

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

URLhttps://arxiv.org/abs/2511.08667. Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Son- tag. TabLLM: Few-shot classification of tabular data with large language models. InInternational Conference on Artificial Intelligence and Statistics (AISTATS),

work page internal anchor Pith review arXiv
[9]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

URLhttps://arxiv.org/abs/2207.01848. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie- Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems (NeurIPS),

work page internal anchor Pith review arXiv
[11]

Arthur Dantas Mangussi, Ricardo Cardoso Pereira, Ana Carolina Lorena, and Pedro Henriques Abreu

URL https://arxiv.org/ abs/2507.08280. Arthur Dantas Mangussi, Ricardo Cardoso Pereira, Ana Carolina Lorena, and Pedro Henriques Abreu. Large language models for missing data imputation: Understanding behavior, hallucination effects, and control mechanisms. arXiv preprint arXiv:2603.22332,

work page arXiv
[12]

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin

URL https://arxiv.org/ abs/2603.22332. Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: Unbiased boosting with categorical features. InAdvances in Neural Information Processing Systems (NeurIPS),

work page arXiv
[14]

Tabiclv2: A better, faster, scalable, and open tabular foundation model, 2026

URL https://arxiv.org/abs/2602.11139. Ivan Rubachev, Nikolay Kartashev, Yury Gorishniy, and Artem Babenko. TabReD: Analyzing pitfalls and filling the gaps in tabular deep learning benchmarks. InInternational Conference on Learning Representations (ICLR),

work page arXiv
[15]

Ravid Shwartz-Ziv and Amitai Armon

URLhttps://arxiv.org/abs/2406.19380. Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need.Information Fusion, 81:84–90,

work page arXiv
[16]

Tabular Data: Deep Learning is Not All You Need,

URLhttps://arxiv.org/abs/2106.03253. 20 Marco Spinaci, Marek Polewczyk, Maximilian Schambach, and Sam Thelin. ConTextTab: A semantics-aware tabular in-context learner. InAdvances in Neural Information Processing Systems (NeurIPS), Spotlight,

work page arXiv
[17]

arXiv:2506.10707 [cs]

URLhttps://arxiv.org/abs/2506.10707. Aofeng Su, Aowen Wang, Chao Ye, Chen Zhou, Ga Zhang, Gang Chen, Guangcheng Zhu, Haobo Wang, Haokai Xu, Hao Chen, et al. TableGPT2: A large multimodal model with tabular data integration. arXiv preprint arXiv:2411.02059,

work page arXiv
[18]

URLhttps: //arxiv.org/abs/2405.01147. 21

work page arXiv