pith. machine review for the scientific record. sign in

arxiv: 2605.04363 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords label shifttabular datain-context learningTabPFNposterior adjustmenttest-time adaptationfoundation modelsclassification
0
0 comments X

The pith

DistPFN rescales TabPFN output probabilities at test time to counteract label shift by downweighting the training class prior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

TabPFN models for tabular data tend to overfit to the class distribution seen in their context examples, leading to poor performance when the test data has a different label distribution. The paper introduces DistPFN, a method that adjusts the model's predicted probabilities after inference by reducing the effect of the observed training prior and optionally applying an adaptive temperature. This requires no changes to the model architecture or any retraining. Experiments across more than 250 OpenML datasets show clear gains under label shift while leaving accuracy unchanged when no shift is present.

Core claim

TabPFN overfits to the majority class in the training context under label shift. DistPFN corrects this by rescaling the predicted class probabilities to downweight the training prior and emphasize the model's own posterior, with an optional temperature-scaled variant that adapts the adjustment strength to the observed discrepancy between prior and posterior. The adjustment occurs purely at test time with no architectural modifications or additional training.

What carries the argument

DistPFN posterior adjustment, which rescales predicted probabilities by downweighting the training prior (and optionally applies adaptive temperature scaling based on prior-posterior discrepancy).

Load-bearing premise

Rescaling the output probabilities by downweighting the training prior will recover a better posterior under label shift without introducing new errors or needing the true test prior.

What would settle it

Measure accuracy on a set of label-shifted datasets using the adjusted probabilities versus the unadjusted TabPFN outputs and versus an oracle model that has access to the true test label distribution.

Figures

Figures reproduced from arXiv: 2605.04363 by Dongwan Kang, Hwanil Choi, Jaehoon Lee, Jun Seo, Minjae Kim, Seunghan Lee, Soonyoung Lee, Sungdong Yoo, Tae Yoon Lim, Wonbin Ahn.

Figure 1
Figure 1. Figure 1: Majority-class bias. TabPFN suffers from majority￾class bias, resulting in poor recall for the minority class. 2020; Gorishniy et al., 2021), have emerged as strong alter￾natives by capturing complex feature interactions. Among these methods, TabPFN (Hollmann et al., 2023) introduces in-context learning (ICL) to tabular classifi￾cation by pretraining on synthetic datasets and produc￾ing predictions for tes… view at source ↗
Figure 3
Figure 3. Figure 3: Robustness to shift with DistPFN. tion datasets to evaluate the effectiveness of DistPFN in both standard and label-shifted classification scenarios. As shown in view at source ↗
Figure 2
Figure 2. Figure 2: Direct utilization of train-set in TabPFN-based models. v2 (Hollmann et al., 2025) exhibits a majority-class bias, making incorrect predictions even when trained and tested on the same dataset (CostaMadre1 (Bischl et al., 2017)). To this end, we propose DistPFN, a simple yet effective test-time adaptation method that improves the robustness of TabPFN-based models to label shift, without modifying the archi… view at source ↗
Figure 4
Figure 4. Figure 4: Overall framework of DistPFN. (a) TabPFN exhibits a majority-class bias under label shift, predicting test instances toward the majority class in the training dataset. (b) DistPFN mitigates this bias via a simple test-time adaptation method that rescales the predicted class probabilities for each test instance. 4.1. TabPFN In tabular in-context learning, TabPFN predicts the label of a test instance xj by c… view at source ↗
Figure 5
Figure 5. Figure 5: illustrates the label distributions of the training/test datasets after oversampling, with respect to the shift strength (β) and the class distribution of the entire dataset. Higher β assigns higher sampling probabilities to rare classes, induc￾ing a stronger shift, whereas β = 0 corresponds to uniform sampling. Note that β = 0 does not imply the absence of label shift, as the dataset itself may be class-i… view at source ↗
Figure 6
Figure 6. Figure 6: Class balance ratio of 253 OpenML datasets. 5. Experiments 5.1. Experimental Setup Task and metrics. We evaluate our methods on tabular classification tasks both with and without label shift. For the evaluation metrics, we employ accuracy (Acc.) and average rank (Rank), following the previous works (Hollmann et al., 2023). We additionally report expected calibration error (ECE) and precision to provide eva… view at source ↗
Figure 8
Figure 8. Figure 8 view at source ↗
Figure 9
Figure 9. Figure 9: Comparison with oracle. Stronger Shift → Higher Improvement view at source ↗
Figure 10
Figure 10. Figure 10: Per-dataset improvement. The figure shows the accuracy improvement for each dataset under varying β values with DistPFN-T applied, shown against the KL-divergence between the train and test label distributions of the original dataset. + EME + BBE + DistPFN + DistPFN-T LoCalPFN TabICL TabPFN-v2 LoCalPFN TabICL TabPFN-v2 -1.5 -1.1 +0.0 +0.0 -0.8 -0.4 +0.0 +0.0 -1.7 -1.3 +0.0 +0.0 +1.2 +1.6 +1.7 +2.0 +2.4 +2… view at source ↗
read the original abstract

TabPFN has recently gained attention as a foundation model for tabular datasets, achieving strong performance by leveraging in-context learning on synthetic data. However, we find that TabPFN is vulnerable to label shift, often overfitting to the majority class in the training dataset. To address this limitation, we propose DistPFN, the first test-time posterior adjustment method designed for tabular foundation models. DistPFN rescales predicted class probabilities by downweighting the influence of the training prior (i.e., the class distribution of the context) and emphasizing the contribution of the model's predicted posterior, without architectural modification or additional training. We further introduce DistPFN-T, which incorporates temperature scaling to adaptively control the adjustment strength based on the discrepancy between prior and posterior. We evaluate our methods on over 250 OpenML datasets, demonstrating substantial improvements for various TabPFN-based models in classification tasks under label shift, while maintaining strong performance in standard settings without label shift. Code is available at this repository: https://github.com/seunghan96/DistPFN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that TabPFN and similar tabular in-context learning models overfit to the majority class under label shift. It introduces DistPFN, a test-time posterior adjustment that rescales predicted class probabilities by downweighting the class distribution of the in-context examples (training prior) while emphasizing the model's posterior, plus the variant DistPFN-T that adds adaptive temperature scaling based on prior-posterior discrepancy. The methods require no architectural changes or retraining and are reported to yield substantial gains on classification tasks across more than 250 OpenML datasets under label shift while preserving performance in the absence of shift.

Significance. If the adjustment reliably approximates label-shift correction, the work would provide a lightweight, training-free robustness tool for tabular foundation models. The scale of the evaluation (>250 datasets) and public code release are positive features that would support adoption if the central mechanism is shown to be sound.

major comments (2)
  1. [§3.1] §3.1 (DistPFN adjustment formula): The rescaling is presented as downweighting the training prior without an explicit derivation from the label-shift Bayes rule p(y|x) ∝ p(x|y) * p_test(y)/p_train(y). It is not shown whether the particular form recovers (or approximates) the correct test posterior when the true test prior is unknown, or whether it remains a heuristic that can distort probabilities even without shift.
  2. [§4.2] §4.2 (DistPFN-T temperature adaptation): The adaptive temperature is described as depending on the discrepancy between prior and posterior, but the manuscript does not clarify whether this involves any post-hoc fitting on evaluation data or remains strictly test-time; this affects whether the method stays parameter-free as claimed.
minor comments (2)
  1. [§4.1] §4.1: The precise procedure used to induce label shift on the OpenML datasets (e.g., how test class proportions are chosen and whether they are known at adjustment time) should be stated explicitly for reproducibility.
  2. [Table 1] Table 1 and Figure 3: Error bars and statistical significance tests for the reported accuracy/F1 improvements are not described; adding them would strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the opportunity to address the concerns regarding the theoretical grounding of DistPFN and the test-time nature of DistPFN-T. We provide point-by-point responses below and have revised the manuscript accordingly to improve clarity.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (DistPFN adjustment formula): The rescaling is presented as downweighting the training prior without an explicit derivation from the label-shift Bayes rule p(y|x) ∝ p(x|y) * p_test(y)/p_train(y). It is not shown whether the particular form recovers (or approximates) the correct test posterior when the true test prior is unknown, or whether it remains a heuristic that can distort probabilities even without shift.

    Authors: We agree that an explicit derivation strengthens the presentation. DistPFN is motivated by the label-shift correction: the model provides an estimate of p(x|y) p_train(y), so rescaling the output probabilities by the inverse of the context class distribution (p_train(y)) approximates multiplication by p_test(y)/p_train(y) when p_test(y) is unknown. The resulting normalized probabilities therefore emphasize the model's learned posterior while reducing the training prior's influence. We acknowledge this is an approximation rather than an exact recovery of the test posterior. In the revised manuscript we have added a short derivation in §3.1 that starts from the label-shift Bayes rule, states the approximation explicitly, and discusses conditions under which the adjustment remains beneficial. We also include a brief analysis showing that, when there is no shift, the adjustment does not materially distort probabilities (consistent with the empirical results that performance is preserved on unshifted data). revision: yes

  2. Referee: [§4.2] §4.2 (DistPFN-T temperature adaptation): The adaptive temperature is described as depending on the discrepancy between prior and posterior, but the manuscript does not clarify whether this involves any post-hoc fitting on evaluation data or remains strictly test-time; this affects whether the method stays parameter-free as claimed.

    Authors: We apologize for the ambiguity. The temperature in DistPFN-T is computed entirely at test time for each query point: the discrepancy is measured between the empirical class distribution of the in-context examples and the model's own posterior probabilities on that point; the temperature is then set proportionally to this discrepancy. No parameters are optimized or fitted on any evaluation, validation, or held-out data. The procedure uses only quantities already available during inference. We have revised §4.2 to state this explicitly, provide the exact temperature formula, and reaffirm that the method remains strictly test-time and parameter-free. revision: yes

Circularity Check

0 steps flagged

No circularity: DistPFN is an explicitly defined test-time heuristic with external evaluation

full rationale

The paper introduces DistPFN as a proposed rescaling procedure that operates directly on the model's output posterior and the observed class distribution in the in-context examples. No derivation chain is claimed that reduces a 'prediction' or 'first-principles result' back to the same fitted quantities by construction. Evaluation occurs on held-out OpenML datasets under controlled label-shift conditions, and the central performance claims rest on these external benchmarks rather than on any self-referential fit or self-citation load-bearing step. The method is therefore self-contained as an empirical adjustment technique.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; the method rests on the assumption that a simple rescaling of probabilities suffices to correct label-shift bias in in-context tabular models.

axioms (2)
  • domain assumption TabPFN overfits to the class distribution present in the in-context examples
    Stated as the central observed limitation.
  • ad hoc to paper Downweighting the training prior while preserving the model's posterior improves calibration under label shift
    Core design choice of DistPFN.

pith-pipeline@v0.9.0 · 5518 in / 1277 out tokens · 20379 ms · 2026-05-08T18:26:01.741372+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Regularized

    Azizzadenesheli, K., Liu, A., Yang, F., and Anandkumar, A. Regularized learning for domain adaptation under label shifts.arXiv preprint arXiv:1903.09734,

  2. [2]

    arXiv preprint arXiv:1708.03731 , year=

    Bischl, B., Casalicchio, G., Feurer, M., Gijsbers, P., Hutter, F., Lang, M., Mantovani, R. G., van Rijn, J. N., and Vanschoren, J. Openml benchmarking suites.arXiv preprint arXiv:1708.03731,

  3. [3]

    Tabm: Advancing tabular deep learning with parameter-efficient ensembling.ICLR, 2024a

    10 Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment Gorishniy, Y ., Kotelnikov, A., and Babenko, A. Tabm: Advancing tabular deep learning with parameter-efficient ensembling.ICLR, 2024a. Gorishniy, Y ., Rubachev, I., Kartashev, N., Shlenskii, D., Kotelnikov, A., and Babenko, A. Tabr: Tabular deep learning meets near...

  4. [4]

    TabTransformer: Tabular Data Modeling Using Contextual Embeddings

    Huang, X., Khetan, A., Cvitkovic, M., and Karnin, Z. Tab- transformer: Tabular data modeling using contextual em- beddings.arXiv preprint arXiv:2012.06678,

  5. [5]

    Y ., and Yang, E

    Kim, C., Kim, T., Woo, S., Yang, J. Y ., and Yang, E. Adapt- able: Test-time adaptation for tabular data via shift-aware uncertainty calibrator and label distribution handler.arXiv preprint arXiv:2407.10784,

  6. [6]

    Long-tail learning via logit adjustment.arXiv preprint arXiv:2007.07314, 2020

    Menon, A. K., Jayasumana, S., Rawat, A. S., Jain, H., Veit, A., and Kumar, S. Long-tail learning via logit adjustment. arXiv preprint arXiv:2007.07314,

  7. [7]

    Saint: Improved neural networks for tabular data via row attention and contrastive pre-training.arXiv preprint arXiv:2106.01342, 2021

    Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C. B., and Goldstein, T. Saint: Improved neural networks for tabular data via row attention and contrastive pre- training.arXiv preprint arXiv:2106.01342,

  8. [8]

    Thielmann, A. F. and Samiee, S. On the efficiency of nlp- inspired methods for tabular deep learning.arXiv preprint arXiv:2411.17207,

  9. [9]

    Dataset.We evaluate on 250+ tabular datasets from OpenML (Bischl et al., 2017)

    For inference, we load the pretrained weights from TabPFN-v23 available on Hugging Face. Dataset.We evaluate on 250+ tabular datasets from OpenML (Bischl et al., 2017). The dataset list is retrieved from the benchmark configuration provided in this repository4, which is built on top of the official TabPFN evaluation setup. Dataset statistics are summarize...

  10. [10]

    • DL (non-foundation) models (5):FT-Transformer (Gorishniy et al., 2021), TabM (Gorishniy et al., 2024a), Tabu- laRNN (Thielmann & Samiee, 2024), MambaTab (Ahamed & Cheng, 2024), RealMLP (Holzm¨uller et al.,

  11. [11]

    • DL (foundation) models based on ICL (3):TabPFN-v2 (Hollmann et al., 2025), LoCalPFN (Thomas et al., 2024), TabICL (Qu et al.,

  12. [12]

    Details of each method are provided below. C.1. Machine Learning (ML) Models • Logistic Regression (LR)(Cox, 1958): A simple linear model commonly used for binary and multiclass classification tasks in tabular data. • Support Vector Machine (SVM)(Cortes & Vapnik, 1995): A kernel-based classifier that aims to find the optimal decision boundary with maximum...

  13. [13]

    Bayesian view that replaces the mismatched prior with a self-consistent estimate from model predictions (Section E.2). E.1. Relation to Label Shift Correction The label shift setting assumes that the conditional distributionp(x|y)remains invariant while the marginal priors differ: ptrain(y)̸=p test(y), p(x|y)is fixed. Under this assumption, the Bayes-opti...

  14. [14]

    We conduct random search over these spaces and tune the models on validation datasets that are kept separate from the final test splits

    The search spaces are manually designed to cover commonly used ranges for each model class, including both optimization-related parameters and regularization or structural options. We conduct random search over these spaces and tune the models on validation datasets that are kept separate from the final test splits. The details of the hyperparameter searc...

  15. [15]

    Table L.1 presents the results, showing that our method is effective without requiring estimation of the test prior

    and Black-box Estimation (BBE) (Lipton et al., 2018). Table L.1 presents the results, showing that our method is effective without requiring estimation of the test prior. Figure L.1.Comparison with other label shift methods. Methods w/o shift Shift strength (β) 0.0 0.1 0.5 1.0 2.0 5.0 Avg. LoCalPFN 0.816 0.794 0.793 0.788 0.778 0.753 0.719 0.771 + EME 0.8...

  16. [16]

    the average prediction acrossmultiple instances. As shown in Table M.1, both choices consistently improve TabPFN-v2 (Hollmann et al., 2025), averaged across sixβs forw/ shift, demonstrating robustness to the choice of distribution source. Table M.1.Predicted distributions: Single vs. Multiple.The proposed methods consistently improves TabPFN-v2 regardless...