arxiv: 2605.04363 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment

Seunghan Lee , Jaehoon Lee , Jun Seo , Sungdong Yoo , Minjae Kim , Tae Yoon Lim , Dongwan Kang , Hwanil Choi

show 2 more authors

Soonyoung Lee Wonbin Ahn

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords label shifttabular datain-context learningTabPFNposterior adjustmenttest-time adaptationfoundation modelsclassification

0 comments

The pith

DistPFN rescales TabPFN output probabilities at test time to counteract label shift by downweighting the training class prior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

TabPFN models for tabular data tend to overfit to the class distribution seen in their context examples, leading to poor performance when the test data has a different label distribution. The paper introduces DistPFN, a method that adjusts the model's predicted probabilities after inference by reducing the effect of the observed training prior and optionally applying an adaptive temperature. This requires no changes to the model architecture or any retraining. Experiments across more than 250 OpenML datasets show clear gains under label shift while leaving accuracy unchanged when no shift is present.

Core claim

TabPFN overfits to the majority class in the training context under label shift. DistPFN corrects this by rescaling the predicted class probabilities to downweight the training prior and emphasize the model's own posterior, with an optional temperature-scaled variant that adapts the adjustment strength to the observed discrepancy between prior and posterior. The adjustment occurs purely at test time with no architectural modifications or additional training.

What carries the argument

DistPFN posterior adjustment, which rescales predicted probabilities by downweighting the training prior (and optionally applies adaptive temperature scaling based on prior-posterior discrepancy).

Load-bearing premise

Rescaling the output probabilities by downweighting the training prior will recover a better posterior under label shift without introducing new errors or needing the true test prior.

What would settle it

Measure accuracy on a set of label-shifted datasets using the adjusted probabilities versus the unadjusted TabPFN outputs and versus an oracle model that has access to the true test label distribution.

Figures

Figures reproduced from arXiv: 2605.04363 by Dongwan Kang, Hwanil Choi, Jaehoon Lee, Jun Seo, Minjae Kim, Seunghan Lee, Soonyoung Lee, Sungdong Yoo, Tae Yoon Lim, Wonbin Ahn.

**Figure 1.** Figure 1: Majority-class bias. TabPFN suffers from majorityclass bias, resulting in poor recall for the minority class. 2020; Gorishniy et al., 2021), have emerged as strong alternatives by capturing complex feature interactions. Among these methods, TabPFN (Hollmann et al., 2023) introduces in-context learning (ICL) to tabular classification by pretraining on synthetic datasets and producing predictions for tes… view at source ↗

**Figure 3.** Figure 3: Robustness to shift with DistPFN. tion datasets to evaluate the effectiveness of DistPFN in both standard and label-shifted classification scenarios. As shown in view at source ↗

**Figure 2.** Figure 2: Direct utilization of train-set in TabPFN-based models. v2 (Hollmann et al., 2025) exhibits a majority-class bias, making incorrect predictions even when trained and tested on the same dataset (CostaMadre1 (Bischl et al., 2017)). To this end, we propose DistPFN, a simple yet effective test-time adaptation method that improves the robustness of TabPFN-based models to label shift, without modifying the archi… view at source ↗

**Figure 4.** Figure 4: Overall framework of DistPFN. (a) TabPFN exhibits a majority-class bias under label shift, predicting test instances toward the majority class in the training dataset. (b) DistPFN mitigates this bias via a simple test-time adaptation method that rescales the predicted class probabilities for each test instance. 4.1. TabPFN In tabular in-context learning, TabPFN predicts the label of a test instance xj by c… view at source ↗

**Figure 5.** Figure 5: illustrates the label distributions of the training/test datasets after oversampling, with respect to the shift strength (β) and the class distribution of the entire dataset. Higher β assigns higher sampling probabilities to rare classes, inducing a stronger shift, whereas β = 0 corresponds to uniform sampling. Note that β = 0 does not imply the absence of label shift, as the dataset itself may be class-i… view at source ↗

**Figure 6.** Figure 6: Class balance ratio of 253 OpenML datasets. 5. Experiments 5.1. Experimental Setup Task and metrics. We evaluate our methods on tabular classification tasks both with and without label shift. For the evaluation metrics, we employ accuracy (Acc.) and average rank (Rank), following the previous works (Hollmann et al., 2023). We additionally report expected calibration error (ECE) and precision to provide eva… view at source ↗

**Figure 9.** Figure 9: Comparison with oracle. Stronger Shift → Higher Improvement view at source ↗

**Figure 10.** Figure 10: Per-dataset improvement. The figure shows the accuracy improvement for each dataset under varying β values with DistPFN-T applied, shown against the KL-divergence between the train and test label distributions of the original dataset. + EME + BBE + DistPFN + DistPFN-T LoCalPFN TabICL TabPFN-v2 LoCalPFN TabICL TabPFN-v2 -1.5 -1.1 +0.0 +0.0 -0.8 -0.4 +0.0 +0.0 -1.7 -1.3 +0.0 +0.0 +1.2 +1.6 +1.7 +2.0 +2.4 +2… view at source ↗

read the original abstract

TabPFN has recently gained attention as a foundation model for tabular datasets, achieving strong performance by leveraging in-context learning on synthetic data. However, we find that TabPFN is vulnerable to label shift, often overfitting to the majority class in the training dataset. To address this limitation, we propose DistPFN, the first test-time posterior adjustment method designed for tabular foundation models. DistPFN rescales predicted class probabilities by downweighting the influence of the training prior (i.e., the class distribution of the context) and emphasizing the contribution of the model's predicted posterior, without architectural modification or additional training. We further introduce DistPFN-T, which incorporates temperature scaling to adaptively control the adjustment strength based on the discrepancy between prior and posterior. We evaluate our methods on over 250 OpenML datasets, demonstrating substantial improvements for various TabPFN-based models in classification tasks under label shift, while maintaining strong performance in standard settings without label shift. Code is available at this repository: https://github.com/seunghan96/DistPFN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DistPFN is a lightweight test-time rescaling for TabPFN that shows empirical gains under label shift on hundreds of datasets, but the adjustment rule lacks a derivation from the standard label-shift Bayes correction and reads as heuristic.

read the letter

The core contribution is a test-time adjustment that downweights the class distribution seen in the in-context examples and optionally scales by an adaptive temperature based on prior-posterior mismatch. They call the basic version DistPFN and the temperature one DistPFN-T. The abstract reports that this improves TabPFN-based models on classification under label shift across more than 250 OpenML datasets while preserving performance when no shift is present, all without retraining or architecture changes. Code is released, which is useful for anyone who wants to try it directly.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that TabPFN and similar tabular in-context learning models overfit to the majority class under label shift. It introduces DistPFN, a test-time posterior adjustment that rescales predicted class probabilities by downweighting the class distribution of the in-context examples (training prior) while emphasizing the model's posterior, plus the variant DistPFN-T that adds adaptive temperature scaling based on prior-posterior discrepancy. The methods require no architectural changes or retraining and are reported to yield substantial gains on classification tasks across more than 250 OpenML datasets under label shift while preserving performance in the absence of shift.

Significance. If the adjustment reliably approximates label-shift correction, the work would provide a lightweight, training-free robustness tool for tabular foundation models. The scale of the evaluation (>250 datasets) and public code release are positive features that would support adoption if the central mechanism is shown to be sound.

major comments (2)

[§3.1] §3.1 (DistPFN adjustment formula): The rescaling is presented as downweighting the training prior without an explicit derivation from the label-shift Bayes rule p(y|x) ∝ p(x|y) * p_test(y)/p_train(y). It is not shown whether the particular form recovers (or approximates) the correct test posterior when the true test prior is unknown, or whether it remains a heuristic that can distort probabilities even without shift.
[§4.2] §4.2 (DistPFN-T temperature adaptation): The adaptive temperature is described as depending on the discrepancy between prior and posterior, but the manuscript does not clarify whether this involves any post-hoc fitting on evaluation data or remains strictly test-time; this affects whether the method stays parameter-free as claimed.

minor comments (2)

[§4.1] §4.1: The precise procedure used to induce label shift on the OpenML datasets (e.g., how test class proportions are chosen and whether they are known at adjustment time) should be stated explicitly for reproducibility.
[Table 1] Table 1 and Figure 3: Error bars and statistical significance tests for the reported accuracy/F1 improvements are not described; adding them would strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the opportunity to address the concerns regarding the theoretical grounding of DistPFN and the test-time nature of DistPFN-T. We provide point-by-point responses below and have revised the manuscript accordingly to improve clarity.

read point-by-point responses

Referee: [§3.1] §3.1 (DistPFN adjustment formula): The rescaling is presented as downweighting the training prior without an explicit derivation from the label-shift Bayes rule p(y|x) ∝ p(x|y) * p_test(y)/p_train(y). It is not shown whether the particular form recovers (or approximates) the correct test posterior when the true test prior is unknown, or whether it remains a heuristic that can distort probabilities even without shift.

Authors: We agree that an explicit derivation strengthens the presentation. DistPFN is motivated by the label-shift correction: the model provides an estimate of p(x|y) p_train(y), so rescaling the output probabilities by the inverse of the context class distribution (p_train(y)) approximates multiplication by p_test(y)/p_train(y) when p_test(y) is unknown. The resulting normalized probabilities therefore emphasize the model's learned posterior while reducing the training prior's influence. We acknowledge this is an approximation rather than an exact recovery of the test posterior. In the revised manuscript we have added a short derivation in §3.1 that starts from the label-shift Bayes rule, states the approximation explicitly, and discusses conditions under which the adjustment remains beneficial. We also include a brief analysis showing that, when there is no shift, the adjustment does not materially distort probabilities (consistent with the empirical results that performance is preserved on unshifted data). revision: yes
Referee: [§4.2] §4.2 (DistPFN-T temperature adaptation): The adaptive temperature is described as depending on the discrepancy between prior and posterior, but the manuscript does not clarify whether this involves any post-hoc fitting on evaluation data or remains strictly test-time; this affects whether the method stays parameter-free as claimed.

Authors: We apologize for the ambiguity. The temperature in DistPFN-T is computed entirely at test time for each query point: the discrepancy is measured between the empirical class distribution of the in-context examples and the model's own posterior probabilities on that point; the temperature is then set proportionally to this discrepancy. No parameters are optimized or fitted on any evaluation, validation, or held-out data. The procedure uses only quantities already available during inference. We have revised §4.2 to state this explicitly, provide the exact temperature formula, and reaffirm that the method remains strictly test-time and parameter-free. revision: yes

Circularity Check

0 steps flagged

No circularity: DistPFN is an explicitly defined test-time heuristic with external evaluation

full rationale

The paper introduces DistPFN as a proposed rescaling procedure that operates directly on the model's output posterior and the observed class distribution in the in-context examples. No derivation chain is claimed that reduces a 'prediction' or 'first-principles result' back to the same fitted quantities by construction. Evaluation occurs on held-out OpenML datasets under controlled label-shift conditions, and the central performance claims rest on these external benchmarks rather than on any self-referential fit or self-citation load-bearing step. The method is therefore self-contained as an empirical adjustment technique.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; the method rests on the assumption that a simple rescaling of probabilities suffices to correct label-shift bias in in-context tabular models.

axioms (2)

domain assumption TabPFN overfits to the class distribution present in the in-context examples
Stated as the central observed limitation.
ad hoc to paper Downweighting the training prior while preserving the model's posterior improves calibration under label shift
Core design choice of DistPFN.

pith-pipeline@v0.9.0 · 5518 in / 1277 out tokens · 20379 ms · 2026-05-08T18:26:01.741372+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation (J-cost uniqueness) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DistPFN rescales predicted class probabilities by downweighting the influence of the training prior ... and emphasizing the contribution of the model's predicted posterior
IndisputableMonolith/Foundation/AlphaCoordinateFixation (calibration of cost via derivatives) alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

τ = CE(p̂_TabPFN(y), p_train(y)) ... temperature-scaled softmax

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Regularized

Azizzadenesheli, K., Liu, A., Yang, F., and Anandkumar, A. Regularized learning for domain adaptation under label shifts.arXiv preprint arXiv:1903.09734,

work page arXiv 1903
[2]

arXiv preprint arXiv:1708.03731 , year=

Bischl, B., Casalicchio, G., Feurer, M., Gijsbers, P., Hutter, F., Lang, M., Mantovani, R. G., van Rijn, J. N., and Vanschoren, J. Openml benchmarking suites.arXiv preprint arXiv:1708.03731,

work page arXiv
[3]

Tabm: Advancing tabular deep learning with parameter-efficient ensembling.ICLR, 2024a

10 Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment Gorishniy, Y ., Kotelnikov, A., and Babenko, A. Tabm: Advancing tabular deep learning with parameter-efficient ensembling.ICLR, 2024a. Gorishniy, Y ., Rubachev, I., Kartashev, N., Shlenskii, D., Kotelnikov, A., and Babenko, A. Tabr: Tabular deep learning meets near...

2023
[4]

TabTransformer: Tabular Data Modeling Using Contextual Embeddings

Huang, X., Khetan, A., Cvitkovic, M., and Karnin, Z. Tab- transformer: Tabular data modeling using contextual em- beddings.arXiv preprint arXiv:2012.06678,

work page internal anchor Pith review arXiv 2012
[5]

Y ., and Yang, E

Kim, C., Kim, T., Woo, S., Yang, J. Y ., and Yang, E. Adapt- able: Test-time adaptation for tabular data via shift-aware uncertainty calibrator and label distribution handler.arXiv preprint arXiv:2407.10784,

work page arXiv
[6]

Long-tail learning via logit adjustment.arXiv preprint arXiv:2007.07314, 2020

Menon, A. K., Jayasumana, S., Rawat, A. S., Jain, H., Veit, A., and Kumar, S. Long-tail learning via logit adjustment. arXiv preprint arXiv:2007.07314,

work page arXiv 2007
[7]

Saint: Improved neural networks for tabular data via row attention and contrastive pre-training.arXiv preprint arXiv:2106.01342, 2021

Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C. B., and Goldstein, T. Saint: Improved neural networks for tabular data via row attention and contrastive pre- training.arXiv preprint arXiv:2106.01342,

work page arXiv
[8]

Thielmann, A. F. and Samiee, S. On the efficiency of nlp- inspired methods for tabular deep learning.arXiv preprint arXiv:2411.17207,

work page arXiv
[9]

Dataset.We evaluate on 250+ tabular datasets from OpenML (Bischl et al., 2017)

For inference, we load the pretrained weights from TabPFN-v23 available on Hugging Face. Dataset.We evaluate on 250+ tabular datasets from OpenML (Bischl et al., 2017). The dataset list is retrieved from the benchmark configuration provided in this repository4, which is built on top of the official TabPFN evaluation setup. Dataset statistics are summarize...

2017
[10]

• DL (non-foundation) models (5):FT-Transformer (Gorishniy et al., 2021), TabM (Gorishniy et al., 2024a), Tabu- laRNN (Thielmann & Samiee, 2024), MambaTab (Ahamed & Cheng, 2024), RealMLP (Holzm¨uller et al.,

2021
[11]

• DL (foundation) models based on ICL (3):TabPFN-v2 (Hollmann et al., 2025), LoCalPFN (Thomas et al., 2024), TabICL (Qu et al.,

2025
[12]

Details of each method are provided below. C.1. Machine Learning (ML) Models • Logistic Regression (LR)(Cox, 1958): A simple linear model commonly used for binary and multiclass classification tasks in tabular data. • Support Vector Machine (SVM)(Cortes & Vapnik, 1995): A kernel-based classifier that aims to find the optimal decision boundary with maximum...

1958
[13]

Bayesian view that replaces the mismatched prior with a self-consistent estimate from model predictions (Section E.2). E.1. Relation to Label Shift Correction The label shift setting assumes that the conditional distributionp(x|y)remains invariant while the marginal priors differ: ptrain(y)̸=p test(y), p(x|y)is fixed. Under this assumption, the Bayes-opti...

2002
[14]

We conduct random search over these spaces and tune the models on validation datasets that are kept separate from the final test splits

The search spaces are manually designed to cover commonly used ranges for each model class, including both optimization-related parameters and regularization or structural options. We conduct random search over these spaces and tune the models on validation datasets that are kept separate from the final test splits. The details of the hyperparameter searc...

2017
[15]

Table L.1 presents the results, showing that our method is effective without requiring estimation of the test prior

and Black-box Estimation (BBE) (Lipton et al., 2018). Table L.1 presents the results, showing that our method is effective without requiring estimation of the test prior. Figure L.1.Comparison with other label shift methods. Methods w/o shift Shift strength (β) 0.0 0.1 0.5 1.0 2.0 5.0 Avg. LoCalPFN 0.816 0.794 0.793 0.788 0.778 0.753 0.719 0.771 + EME 0.8...

2018
[16]

the average prediction acrossmultiple instances. As shown in Table M.1, both choices consistently improve TabPFN-v2 (Hollmann et al., 2025), averaged across sixβs forw/ shift, demonstrating robustness to the choice of distribution source. Table M.1.Predicted distributions: Single vs. Multiple.The proposed methods consistently improves TabPFN-v2 regardless...

2025