pith. sign in

arxiv: 2605.13986 · v2 · pith:XSJCDPQSnew · submitted 2026-05-13 · 💻 cs.LG · stat.ML

TabPFN-3: Technical Report

Pith reviewed 2026-05-15 06:01 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords tabular datafoundation modelssynthetic pretrainingTabArena benchmarktest-time scalinggradient boostingrelational datatime series
0
0 comments X

The pith

TabPFN-3 outperforms all tuned and ensembled models on the TabArena tabular benchmark with a single forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

TabPFN-3 advances tabular foundation models by pretraining exclusively on synthetic data to handle datasets up to one million rows. It delivers superior prediction accuracy compared to gradient-boosted trees and other baselines while cutting training and inference times substantially. The model also introduces test-time compute scaling, allowing further performance gains through additional computation at inference. These improvements extend to time series, relational, and tabular-text data, positioning TabPFN-3 as a versatile tool for high-value prediction problems in science and industry.

Core claim

TabPFN-3 achieves state-of-the-art performance on tabular prediction tasks by scaling a transformer-based foundation model pretrained on synthetic data. On the TabArena benchmark, a single forward pass surpasses all other models including tuned and ensembled baselines, while dominating the speed-performance trade-off. The TabPFN-3-Plus variant, leveraging test-time compute, further improves results by over 200 Elo points overall and 420 on large subsets, outperforming AutoGluon while being ten times faster. The approach extends to new domains with new state-of-the-art results on relational benchmarks and tabular-text tasks, all while being up to twenty times faster than its predecessor and s

What carries the argument

Synthetic pretraining combined with test-time compute scaling in a transformer architecture for tabular data.

If this is right

  • Beats 8-hour-tuned gradient-boosted-tree baselines on datasets up to 1M rows and 200 features.
  • Ranks first on datasets with many classes.
  • Achieves new SOTA on RelBenchV1 for relational data.
  • Provides SOTA on TabSTAR for tabular-text data via TabPFN-3-Plus.
  • Enables up to 120x faster SHAP-value computation and ranks 2nd on fev-bench via TabPFN-TS-3.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reliance on synthetic data may allow the model to avoid privacy issues associated with real data training.
  • Test-time scaling opens the door to further performance improvements by allocating more compute during prediction without retraining.
  • The speed gains could make foundation models practical for real-time tabular applications where traditional methods were too slow.
  • Integration improvements suggest easier adoption in existing pipelines for time-series and interpretability tasks.

Load-bearing premise

The synthetic data used for pretraining sufficiently represents the distribution of real-world tabular datasets to enable strong generalization.

What would settle it

Evaluating TabPFN-3 on a large collection of previously unseen real-world tabular datasets collected after the model's release to check if the performance margins hold.

Figures

Figures reproduced from arXiv: 2605.13986 by Adrian Hayler, Alan Arazi, Anurag Garg, Benjamin J\"ager, Bernhard Sch\"olkopf, Brendan Roof, Clara Cornu, David Salinas, Diana Kriuchkova, Dominik Safaric, Eliott Kalfon, Felix Birkel, Frank Hutter, Georg Grab, Jake Robertson, Jan Hendrik Metzen, Jerry Chen, Julien Siems, Klemens Fl\"oge, Kursat Kaya, Lennart Purucker, L\'eo Grinsztajn, Lilly Charlotte Wehrhahn, Lydia Sidhoum, Madelon Hulsebos, Magnus B\"uhler, Marie Salmon, Mihir Manium, Nick Erickson, Noah Hollmann, Oscar Key, Philipp Jund, Philipp Singer, Samuel M\"uller, Sauraj Gambhir, Shi Bin Hoo, Simon Bing, Simone Alessi, Siyuan Guo, Vladyslav Moroshan, Yann LeCun.

Figure 1
Figure 1. Figure 1: Performance on the TabArena benchmark [1], largest data subset (10k-100k samples). TabPFN-3 outperforms any other model in a forward pass. TabPFN-3-Plus (Thinking) is dramatically better yet, outperforming AutoGluon 1.5 extreme [2], a complex ensemble of models tuned for 4 hours, while being 10x faster. 1 arXiv:2605.13986v1 [cs.LG] 13 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: TabPFN-3 dominates the Pareto frontier on the largest datasets in TabArena (10k–100k rows). N1, N2, and N4 are model versions with 1, 2, and 4 estimators. Improvability measures how much worse a model is than the best per-dataset model. See Appendix E.2.1 and E.2.3 for details. TabPFN-3-Thinking TabPFN-3 (default) AutoGluon 1.5 TabICLv2 (default) TabPFN-2.6 (default) RealTabPFN-2.5 (T+E) RealMLP (T+E) CatB… view at source ↗
Figure 4
Figure 4. Figure 4: Evolution and performance of the TabPFN model family. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Architecture of TabPFN-3, adapted from the TabICLv2 architecture. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Chunking flattens the peak-memory without impacting the time-per-call. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: KV-cache on H100 for a single estimator without preprocessing: OOM frontier with chunking and KV-cache (a) and cached-predict latency vs. uncached paths (b). This achieves a KV-cache size of 7GiB per estimator for 1M rows datasets, making TabPFN-3’s default 8 estimators usable on common GPUs even for the largest datasets we support. As can be seen in Figure 7a, peak memory of (chunked) cache-predict is bas… view at source ↗
Figure 8
Figure 8. Figure 8: TabPFN-3’s KV-cached predict allows for one to three orders of magnitude speedup. We report results for a single estimator without preprocessing on an H100, for nfeatures ∈ {10, 100} and ntest = 100. Four series per panel: TabPFN-2.5 fit+predict (black, baseline), TabPFN-3 cold fit+predict (blue, no cache reuse), TabPFN-3 fit (build cache) that builds the cache (magenta – overlaps the cold curve since the … view at source ↗
Figure 9
Figure 9. Figure 9: Schematic visualization of our SCM prior. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: TabPFN-3 performance on the standard TabArena benchmark [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Pareto frontier on TabArena: trade-off between prediction quality and total training + inference cost. N1, N2, and N4 are TabPFN-3 versions with 1, 2, and 4 estimators. Improvability measures how much a model would improve by switching to the best model on each individual dataset, see Appendix E.2.1. TabPFN-3-Thinking AutoGluon 1.5 TabPFN-3 (default) TabICLv2 (default) TabPFN-2.6 (default) RealTabPFN-2.5 … view at source ↗
Figure 13
Figure 13. Figure 13: Average rank on the TALENT benchmark, using the TabICLv2 evaluation protocol [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Performance over the TabSTAR Text-Tabular Collection [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: TabPFN-3 achieves state-of-the-art performance on the large-rows benchmark (up to [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: TabPFN-3 tops the normalized scaling curves for ROC-AUC OvR classification and [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: On the synthetic many-class benchmark TabPFN-3 achieves a normalized ROC-AUC [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: TabPFN scales well to high-dimensional, low-sample classification. Normalized ROC-AUC on the many-features benchmark slice, consisting of 6 classification datasets with 102–322 samples and 1,117–22,215 features. This high-dimensional, low-sample regime is particularly challenging for standard tree-based baselines. Increasing the number of TabPFN-3 estimators improves feature-space coverage and substantial… view at source ↗
Figure 20
Figure 20. Figure 20: Qualitative forecast comparison on a fev-bench task ( [PITH_FULL_IMAGE:figures/full_fig_p019_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: TabPFN-3 tops performance on RelBenchV1 among foundation models. [PITH_FULL_IMAGE:figures/full_fig_p020_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: TabPFN-3 extracts semantically-meaningful row embeddings. [PITH_FULL_IMAGE:figures/full_fig_p022_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Visualization of directed acyclic graphs underlying our SCM prior [PITH_FULL_IMAGE:figures/full_fig_p048_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Visualization of functional relationships generated by the new combiner mechanisms in [PITH_FULL_IMAGE:figures/full_fig_p048_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Example classification dataset generated from the prior. [PITH_FULL_IMAGE:figures/full_fig_p049_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Example demonstrating the extrapolation capabilities of TabPFN-3 (using our [PITH_FULL_IMAGE:figures/full_fig_p049_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: TabPFN-3 as a T/X/S-Learner. TabPFN-3 when used as a T/S-Learner achieves strong performance in terms of QINI-score (↑) in Uplift Modeling on the scikit-uplift benchmark. We report worsened performance in terms of PEHE (↓) on the RealCause benchmark compared to the previous version. this evaluation below. Real-World QINI Evaluation. One of the major drawbacks in evaluating causal inference methods is refe… view at source ↗
Figure 28
Figure 28. Figure 28: Average rank on the TALENT benchmark broken down by task type (regression, [PITH_FULL_IMAGE:figures/full_fig_p057_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Average rank on the many-classes TALENT slice (4 datasets, all 100 classes). [PITH_FULL_IMAGE:figures/full_fig_p058_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Average rank on the large-rows (100k-1M rows) TALENT slice. E.3.5 Details Per-dataset ranking. For each (dataset, split) we rank all methods by their score (best = 1; ties get average ranks). The reported mean rank of a method is the average of these for ranks across all (dataset, split) pairs in the slice. Bootstrap confidence intervals. 95% confidence intervals are non-parametric bootstrap over datasets… view at source ↗
Figure 31
Figure 31. Figure 31: Performance on the classification tasks of the TabSTAR text-tabular collection. [PITH_FULL_IMAGE:figures/full_fig_p059_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Performance on the regression tasks of the TabSTAR text-tabular collection. [PITH_FULL_IMAGE:figures/full_fig_p059_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Critical difference diagram for ROC-AUC on the large-scale classification benchmark [PITH_FULL_IMAGE:figures/full_fig_p061_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Critical difference diagram for RMSE on the large-scale regression benchmark [PITH_FULL_IMAGE:figures/full_fig_p062_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Critical difference diagram for pinball loss on our quantile regression benchmark. [PITH_FULL_IMAGE:figures/full_fig_p063_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Critical difference diagram for ROC AUC on the synthetic many-class benchmark (up [PITH_FULL_IMAGE:figures/full_fig_p063_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Forward-pass inference speed-ups on the TabPFN-3 architecture. [PITH_FULL_IMAGE:figures/full_fig_p064_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Efficiency gains for SHAP-value computation with KV-cache across training table [PITH_FULL_IMAGE:figures/full_fig_p065_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: solar_with_weather_15T — 15-minute solar generation with weather covariates. 0 10 20 30 40 22500 25000 27500 30000 32500 35000 37500 Sales History + Ground Truth 30 33 36 39 MASE=0.20, CRPS=0.03 TabPFN-TS-3 30 33 36 39 MASE=0.63, CRPS=0.09 TabICL (v2.0.3) 30 33 36 39 MASE=0.49, CRPS=0.07 Chronos-2 30 33 36 39 MASE=0.55, CRPS=0.08 TiRex Dynamic Covariates 0 10 20 30 40 0.72 0.74 0.76 0.78 0.80 0.82 0.84 0.… view at source ↗
Figure 40
Figure 40. Figure 40: rossmann_1W — weekly Rossmann store sales (series 1). 66 [PITH_FULL_IMAGE:figures/full_fig_p066_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: rohlik_orders_1D — daily online-grocery orders. 0 1500 3000 4500 6000 25 50 target History + Ground Truth 7080 7120 7160 7200 7240 MASE=0.59, CRPS=0.03 TabPFN-TS-3 7080 7120 7160 7200 7240 MASE=0.61, CRPS=0.03 TabICL (v2.0.3) 7080 7120 7160 7200 7240 MASE=0.58, CRPS=0.03 Chronos-2 7080 7120 7160 7200 7240 MASE=0.62, CRPS=0.03 TiRex LOOP_SEATTLE_1H (1 target) Ground Truth Prediction 10th-90th quantile [PI… view at source ↗
Figure 42
Figure 42. Figure 42: LOOP_SEATTLE_1H — hourly Seattle freeway loop-detector counts. 67 [PITH_FULL_IMAGE:figures/full_fig_p067_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: ETT_1H — hourly Electricity Transformer Temperature. 0 10k 20k 30k 40k 10000 20000 30000 40000 target History + Ground Truth 40.48k 40.52k 40.56k 40.6k MASE=0.65, CRPS=0.05 TabPFN-TS-3 40.48k 40.52k 40.56k 40.6k MASE=0.89, CRPS=0.07 TabICL (v2.0.3) 40.48k 40.52k 40.56k 40.6k MASE=0.76, CRPS=0.06 Chronos-2 40.48k 40.52k 40.56k 40.6k MASE=0.73, CRPS=0.06 TiRex Dynamic Covariates 0 10k 20k 30k 40k 0 100 200 … view at source ↗
Figure 44
Figure 44. Figure 44: entsoe_1H — hourly ENTSO-E European electricity load. 68 [PITH_FULL_IMAGE:figures/full_fig_p068_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: Pairwise skill-score comparison on fev-bench (100 tasks) under SQL (left) and MASE (right). Cell (i, j) is the skill score of model i relative to model j, with 95% confidence intervals from bootstrapped resampling; cells whose interval overlaps zero are shown in italics. Rows and columns are ordered by overall skill score. Best viewed on screen. fev-bench per-task SQL leaderboard [PITH_FULL_IMAGE:figures… view at source ↗
read the original abstract

Tabular data underpins most high-value prediction problems in science and industry, and TabPFN has driven the foundation model revolution for this modality. Designed with feedback from our users, TabPFN-3 builds on this foundation to scale state-of-the-art performance to datasets with 1M training rows and substantially reduce training and inference time. Pretrained exclusively on synthetic data from our prior, TabPFN-3 dramatically pushes the frontier of tabular prediction and brings substantial gains on time series, relational, and tabular-text data. On the standard tabular benchmark TabArena, a forward pass of TabPFN-3 outperforms all other models, including tuned and ensembled baselines, by a significant margin, and pareto-dominates the speed/performance frontier. On more diverse datasets, TabPFN-3 ranks first on datasets with many classes, and beats 8-hour-tuned gradient-boosted-tree baselines on datasets up to 1M training rows and 200 features. TabPFN-3 introduces test-time compute scaling to tabular foundation models. Our API offering TabPFN-3-Plus (Thinking) exploits this to beat all non-TabPFN models by over 200 Elo on TabArena, rising to 420 Elo on the largest data subset, and outperforms AutoGluon 1.5 extreme while being 10x faster, without using LLMs, real data, internet search or any other model besides TabPFN. TabPFN-3 extends the capabilities of our models, enabling SOTA prediction on relational data (new SOTA foundation model on RelBenchV1) and tabular-text data (SOTA on TabSTAR via TabPFN-3-Plus); and improves existing integrations: a specialized checkpoint, TabPFN-TS-3, ranks 2nd on the time-series benchmark fev-bench, and SHAP-value computation is up to 120x faster. TabPFN-3 achieves this performance while being up to 20x faster than TabPFN-2.5. In addition, a reduced KV cache and row-chunking scale to 1M rows on one H100 with fast inference speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces TabPFN-3, a scaled tabular foundation model pretrained exclusively on synthetic data from prior work. It claims a single forward pass outperforms all tuned and ensembled baselines on the TabArena benchmark while Pareto-dominating the speed-performance frontier; TabPFN-3-Plus (Thinking) achieves >200 Elo gains (up to 420 on large subsets) over non-TabPFN models, beats AutoGluon 1.5 extreme at 10x speed, and delivers new SOTAs on RelBenchV1 (relational) and TabSTAR (tabular-text) plus second place on fev-bench (time-series). Additional engineering claims include up to 20x speedups over TabPFN-2.5, 120x faster SHAP, and scaling to 1 M rows on one H100 via reduced KV cache and row chunking.

Significance. If the empirical results hold after rigorous validation, the work would represent a meaningful advance for tabular foundation models by demonstrating that synthetic pretraining plus test-time compute scaling can surpass heavily tuned gradient-boosted trees and AutoML systems on public benchmarks while delivering substantial inference speed-ups and cross-modal extensions. The reported ability to handle up to 1 M rows without real-data fine-tuning would be practically significant for industry deployments.

major comments (2)
  1. [Abstract] Abstract and Experimental Results: The headline claims of 'significant margin' outperformance on TabArena and 200–420 Elo gains for TabPFN-3-Plus are presented without error bars, statistical significance tests, exact train/test splits, or ablation studies on the synthetic generator, rendering the margins unverifiable from the provided information.
  2. [Pretraining Methodology] Pretraining section: The statement that pretraining uses 'exclusively synthetic data from our prior' lacks any decontamination protocol, held-out real-data validation set, or ablation freezing the generator before benchmark exposure; without these, the reported superiority over tuned baselines on TabArena risks circularity if the generator's feature/label/missingness distributions were calibrated to the evaluation suites.
minor comments (1)
  1. [Abstract] Clarify the precise mechanism of 'test-time compute scaling' in TabPFN-3-Plus (Thinking) and confirm it uses only TabPFN internals with no external models or search.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our technical report. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation of our results without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Experimental Results: The headline claims of 'significant margin' outperformance on TabArena and 200–420 Elo gains for TabPFN-3-Plus are presented without error bars, statistical significance tests, exact train/test splits, or ablation studies on the synthetic generator, rendering the margins unverifiable from the provided information.

    Authors: We agree that additional statistical details would improve verifiability. In the revised manuscript we have added error bars (standard deviation over 10 independent runs with different random seeds) to all TabArena metrics, included Wilcoxon signed-rank tests confirming significance (p < 0.01 for the headline margins), explicitly cited the exact TabArena train/test splits per the benchmark protocol, and inserted an appendix ablation varying the synthetic generator's key hyperparameters while measuring downstream Elo impact. revision: yes

  2. Referee: [Pretraining Methodology] Pretraining section: The statement that pretraining uses 'exclusively synthetic data from our prior' lacks any decontamination protocol, held-out real-data validation set, or ablation freezing the generator before benchmark exposure; without these, the reported superiority over tuned baselines on TabArena risks circularity if the generator's feature/label/missingness distributions were calibrated to the evaluation suites.

    Authors: The generator was developed and frozen in prior work before TabArena and the other cited benchmarks existed, so no direct calibration occurred. We have added a new subsection to the Pretraining section that (1) describes the decontamination protocol (Kolmogorov-Smirnov tests on held-out real tabular samples to confirm distribution mismatch), (2) references a held-out real-data validation set used during generator development, and (3) reports an ablation in which the generator parameters are frozen prior to any benchmark exposure, showing that TabPFN-3 performance remains essentially unchanged. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to prior synthetic generator; benchmark results remain externally validated

full rationale

The paper reports empirical wins on independent public benchmarks (TabArena, RelBenchV1, fev-bench) after pretraining exclusively on synthetic data referenced to prior work. No equations, fitted parameters, or derivations reduce the claimed Elo margins, speedups, or Pareto dominance to quantities defined by the evaluation suites themselves. The single self-reference to 'our prior' for the data generator is present but does not carry the load-bearing justification for the performance numbers, which rest on external test sets.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claims rest on the unstated assumption that synthetic data from prior TabPFN versions is sufficiently representative of real tabular distributions.

pith-pipeline@v0.9.0 · 5882 in / 1368 out tokens · 73855 ms · 2026-05-15T06:01:00.864685+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RelPrism: A Multi-Faceted Pre-training Framework with Self-Generated Tasks for Relational Databases

    cs.LG 2026-05 unverdicted novelty 6.0

    RelPrism generates self-supervised pseudo-tasks from three attribute perspectives via multi-granularity clustering to improve representation learning for relational database prediction tasks.