Is TabPFN the Silver Bullet for Insurance Pricing?

Bruno Deprez; Tim Verdonck; Wouter Verbeke

arxiv: 2605.22892 · v2 · pith:MPNCMKI6new · submitted 2026-05-21 · 💱 q-fin.RM · cs.LG

Is TabPFN the Silver Bullet for Insurance Pricing?

Bruno Deprez , Wouter Verbeke , Tim Verdonck This is my paper

Pith reviewed 2026-05-25 02:39 UTC · model grok-4.3

classification 💱 q-fin.RM cs.LG

keywords TabPFNinsurance pricingGLMXGBoosttabular foundation modelsmotor insuranceclaim frequencyin-context learning

0 comments

The pith

TabPFN does not consistently outperform GLM and XGBoost for motor insurance pricing and shows longer inference times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks TabPFN, a tabular foundation model that performs inference through in-context learning after pre-training on synthetic data, against GLM and XGBoost for claim frequency and severity modeling on two public MTPL datasets. It reports that TabPFN delivers no consistent accuracy gains, requires substantially longer inference times, and varies with the size of the in-context training set. A reader would therefore see current tabular foundation models as promising mainly for data-scarce cases but not yet ready to replace standard actuarial pricing methods.

Core claim

When applied to two public MTPL datasets via in-context learning, TabPFN does not consistently beat GLM or XGBoost in predictive performance for non-life insurance pricing, while incurring substantially longer inference times and exhibiting sensitivity to the size of the in-context training set. The authors conclude that tabular foundation models in their present form do not offer a viable replacement for established actuarial methods such as generalised linear models.

What carries the argument

TabPFN's in-context learning on pre-trained synthetic tabular data, which enables task inference without dataset-specific fitting or hyperparameter tuning.

Load-bearing premise

The two public MTPL datasets and the chosen in-context learning protocol are representative enough to support general conclusions about TabPFN for non-life insurance pricing.

What would settle it

TabPFN showing consistent outperformance over both GLM and XGBoost on several additional, diverse insurance datasets while matching their inference times would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.22892 by Bruno Deprez, Tim Verdonck, Wouter Verbeke.

**Figure 2.** Figure 2: The RMSE and gamma deviance across the five folds for the two severity datasets. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Training and inference time of the models. 5k and 50k for readability reasons, but this does not change the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Modelling claim frequency and severity for non-life insurance pricing predominantly relies on generalised linear models, with gradient-boosted machines as the leading machine learning alternative. Tabular foundation models (TFMs) present a fundamentally different inference paradigm. By pre-training on large collections of synthetic datasets, TFMs enable inference on new data through in-context learning, without any dataset-specific fitting or hyperparameter tuning. This paper presents a first empirical evaluation of TabPFN for motor insurance pricing, benchmarking it against GLM and XGBoost on two publicly available MTPL datasets. Our results show that TabPFN does not consistently outperform established baselines, exhibits substantially longer inference times, and is sensitive to the size of the in-context training set. While tabular foundation models represent a promising direction, particularly in data-scarce settings, their current performance does not offer a viable replacement for established actuarial methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TabPFN does not beat GLM or XGBoost on the two public MTPL datasets and runs slower, but the claim that it is not a viable replacement for non-life pricing rests on narrow evidence.

read the letter

The main things to know are that this is the first direct benchmark of TabPFN against GLM and XGBoost for motor insurance pricing, and the results are negative on both accuracy and speed. TabPFN does not consistently win, inference takes substantially longer, and performance varies with the size of the in-context set. The paper states this plainly and notes that tabular foundation models may still have value in low-data regimes, but current versions are not ready to replace established methods.

Referee Report

3 major / 2 minor

Summary. The paper presents the first empirical evaluation of TabPFN, a tabular foundation model using in-context learning, for motor third-party liability (MTPL) insurance pricing. It benchmarks TabPFN against GLM and XGBoost on two publicly available datasets, reporting that TabPFN does not consistently outperform the baselines, exhibits substantially longer inference times, and is sensitive to the size of the in-context training set. The authors conclude that while TFMs are promising in data-scarce settings, their current formulation does not offer a viable replacement for established actuarial methods.

Significance. If the empirical results are robust, the work supplies a useful negative benchmark on public data that tempers enthusiasm for immediate adoption of current tabular foundation models in non-life pricing. The explicit comparison of inference cost and context-size sensitivity, together with the use of held-out public datasets, provides concrete guidance for future TFM development in actuarial applications.

major comments (3)

[Abstract and §5 (Conclusion)] The generalization that TabPFN 'does not offer a viable replacement for established actuarial methods' rests on results from only two MTPL datasets under a single in-context protocol. The manuscript should either restrict the claim to the tested setting or supply additional evidence (e.g., results on other lines of business or alternative protocols) that these cases are representative; otherwise the central negative conclusion is under-supported.
[§4 (Experiments)] §4 (Experiments): the benchmark tables and figures report point estimates without error bars, standard deviations across random seeds, or statistical significance tests. Given the paper's own finding that performance is sensitive to in-context training-set size, the absence of variability measures makes it impossible to judge whether observed differences versus GLM and XGBoost are reliable.
[§3 (Methodology)] §3 (Methodology): the in-context learning protocol is described at a high level but omits concrete details on feature encoding (especially exposure and categorical insurance variables), the precise construction of the in-context set, and any preprocessing steps. These choices are load-bearing for reproducing the reported sensitivity and inference-time results.

minor comments (2)

[Tables 1–3] Table captions and axis labels should explicitly state the performance metric (e.g., Poisson deviance, Gini coefficient) and the exact context size used for each TabPFN entry.
[Abstract and §4] The abstract states 'substantially longer inference times' without quantifying the factor or hardware; a short table or sentence in §4 giving wall-clock times per prediction would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and §5 (Conclusion)] The generalization that TabPFN 'does not offer a viable replacement for established actuarial methods' rests on results from only two MTPL datasets under a single in-context protocol. The manuscript should either restrict the claim to the tested setting or supply additional evidence (e.g., results on other lines of business or alternative protocols) that these cases are representative; otherwise the central negative conclusion is under-supported.

Authors: We agree that the central conclusion is based solely on the two MTPL datasets and the single in-context protocol examined. We will revise the abstract and §5 to explicitly restrict the claim to these tested settings and datasets, while noting that broader applicability to other lines of business would require additional evaluation. We are not in a position to supply results on further datasets or protocols in the current revision. revision: partial
Referee: [§4 (Experiments)] §4 (Experiments): the benchmark tables and figures report point estimates without error bars, standard deviations across random seeds, or statistical significance tests. Given the paper's own finding that performance is sensitive to in-context training-set size, the absence of variability measures makes it impossible to judge whether observed differences versus GLM and XGBoost are reliable.

Authors: We accept this criticism. In the revised version we will report standard deviations across multiple random seeds for the in-context sets in all tables and figures, and we will add statistical significance tests comparing TabPFN to the baselines to allow readers to assess the reliability of the differences. revision: yes
Referee: [§3 (Methodology)] §3 (Methodology): the in-context learning protocol is described at a high level but omits concrete details on feature encoding (especially exposure and categorical insurance variables), the precise construction of the in-context set, and any preprocessing steps. These choices are load-bearing for reproducing the reported sensitivity and inference-time results.

Authors: We will expand the methodology section to include the missing details: the exact encoding of exposure and categorical variables, the sampling procedure and size selection for the in-context set, and all preprocessing steps applied to the two datasets. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmarking on held-out data

full rationale

The paper performs a direct empirical comparison of TabPFN against GLM and XGBoost baselines on two fixed public MTPL datasets using a stated in-context protocol. No derivations, parameter fits renamed as predictions, self-citations invoked as uniqueness theorems, or ansatzes are present in the central claims. Results are reported from held-out evaluation without any reduction of outputs to inputs by construction, rendering the analysis self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark study; no mathematical derivations, no free parameters fitted to produce the headline result, no new entities postulated, and no domain axioms beyond standard assumptions of supervised learning on tabular data.

pith-pipeline@v0.9.0 · 5676 in / 1033 out tokens · 27260 ms · 2026-05-25T02:39:00.629503+00:00 · methodology

Is TabPFN the Silver Bullet for Insurance Pricing?

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)