Semi-Supervised Hypothesis Testing by Betting on Predictions

Elad Tolochinsky; Yaniv Romano; Yaniv Tenzer

arxiv: 2605.28533 · v1 · pith:E45AYXLZnew · submitted 2026-05-27 · 💻 cs.LG

Semi-Supervised Hypothesis Testing by Betting on Predictions

Yaniv Tenzer , Elad Tolochinsky , Yaniv Romano This is my paper

Pith reviewed 2026-06-29 13:51 UTC · model grok-4.3

classification 💻 cs.LG

keywords semi-supervised hypothesis testingsequential testinge-statisticslabel shiftconcept shiftbetting frameworkprediction-powered inference

0 comments

The pith

An e-statistic from predictions on unlabeled data creates anytime-valid sequential hypothesis tests under label or concept shift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework for sequential hypothesis testing that incorporates predictions made on abundant unlabeled feature data. It defines an e-statistic whose properties ensure the resulting test controls error rates at any stopping time provided the data follows label shift or concept shift. The approach demonstrates non-trivial power for binary classification tasks and preserves validity even when the predictor is weak or uncorrelated with the target. This matters for settings like model evaluation where labeled data is scarce but unlabeled examples and a black-box predictor are available.

Core claim

We introduce an e-statistic and use it to construct a sequential test. Under standard distributional assumptions -- label shift or concept shift -- we establish that the test is anytime valid. Furthermore, we show that for binary data, the e-statistic has non-trivial power. Crucially, our approach retains these properties even when the underlying predictions are inaccurate.

What carries the argument

An e-statistic constructed by betting on predictions of Y from X on unlabeled samples.

If this is right

The sequential test controls type I error at any time under the stated assumptions.
It has non-trivial power against alternatives for binary data.
It outperforms baseline methods and prediction-powered inference in simulations and LLM evaluation tasks.
These advantages remain with limited unlabeled data and low-accuracy predictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The betting construction could be adapted to test other functionals of the conditional distribution beyond the basic hypotheses considered.
Similar use of predictions might increase power in other semi-supervised testing problems that currently rely only on labeled samples.
Applying the method to datasets with independently verified label shift would provide a direct check on the reported power gains.

Load-bearing premise

The data distribution satisfies either label shift or concept shift.

What would settle it

A simulation or real-data case in which the sequential test exceeds its rejection threshold under the null hypothesis at a rate higher than the nominal alpha, while label shift or concept shift holds.

Figures

Figures reproduced from arXiv: 2605.28533 by Elad Tolochinsky, Yaniv Romano, Yaniv Tenzer.

**Figure 1.** Figure 1: Simulation results for the label shift setting. In all settings, the signal strength of the labeled data is fixed and the signal strength of the unlabeled data varies between plots: (a) weak (N = 30, low correlation), (b) medium (N = 135, low correlation), (c) medium-high (N = 30, high correlation), (d) strong (N = 135, high correlation). Baselines with Ours Baselines w/o Ours Only Ours e Y LR e Y X LR 0 2… view at source ↗

**Figure 2.** Figure 2: Simulation results for the concept shift setting. appropriate baselines are e Y LR and e Y |X LR , thus we plot conv(e Y LR, e Y |X LR , ePPI) and conv(e Y LR, e Y |X LR , ePPI, e˘t). Additionally, since we assume we can freely sample unlabeled data from the null, and in this setting PX is fixed between the null and the alternative, we only consider large N, concretely N = 135 [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 3.** Figure 3: Power as a function of step results for online testing of improvement of a new LLM compared to an existing LLM. We use math benchmarks to evaluate the models. In all settings Y is the accuracy of the model and X varies between experiments: (a) X is the identifier of the math question, (b) X is the accuracy according to an LLM judge and we assume access to few unlabeled samples, and (c) where X is the accur… view at source ↗

**Figure 4.** Figure 4: Power as a function of step results for online testing of increase in the procurement of private health insurance. with n = 10 and N ∈ {10, 60}. We simulate each test for 800 steps and repeat the same evaluation protocol as inSection 5. We refer the reader to Section G.1 for a description of the parameters we used. The results appear in Figures 3b and 3c. When only a few unlabeled points are available (Fi… view at source ↗

**Figure 5.** Figure 5: Comparing PPI to only using labeled data for the label shift setting of Section 5.4. E.3. One-Sided PPI The PPI e-process, as we defined in (9), is a two-sided test. However, our setting is one-sided, and while it is valid to use a two-sided test for a one-sided setup, it places the two-sided test at a disadvantage in terms of power. To simulate a fair comparison, we modify the PPI e-process to be aware of… view at source ↗

**Figure 6.** Figure 6: Comparing the imputed process to PPI and one-sided PPI. We simulate a small and large number of unlabeled points per batch (N = 35 and N = 135, respectively), and high correlation (0.3) and low correlation (0.7) settings. to this process as the one-sided PPI e-process. To validate our method against the new process, we repeat the simulation setup of Section 5.4, but now we use the one-sided PPI e-process i… view at source ↗

**Figure 7.** Figure 7: Power curve of each baseline used in the simulations from section 5.4. F.2. Concept Shift: Simulation Setup We set θ null Y |X=0 = 0.4, θ null Y |X=1 = 0.7, and θ null X = 0.5 for the low correlation setting and θ null Y |X=0 = 0.2, θ null Y |X=1 = 0.85, and θ null X = 0.5 for the high correlation setting. To sample from a distribution from the alternative we use: θY |X=0 = θ null Y |X=0 + 0.02 and θY |X=1… view at source ↗

**Figure 8.** Figure 8: Power curve of each baseline used in the simulations from section 5.5. 0 200 400 Step 0.0 0.2 0.4 0.6 0.8 1.0 Power M=2 M=5 M=32 M=128 [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: The effect of M on the power of the imputed e-process in the label shift setting from section 5.4. Baselines with Ours Baselines w/o Ours Only Ours e Y LR e Y X LR 0 200 400 Step 0.0 0.2 0.4 0.6 0.8 1.0 Power (a) N = 135, low correlation 0 200 400 Step 0.0 0.2 0.4 0.6 0.8 1.0 Power (b) N = 135, high correlation [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 10.** Figure 10: Power curves for testing an alternative hypothesis which is not right sided, i.e., θY |X=0 > θnull Y |X=0 and θY |X=1 < θnull Y |X=1. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: False rejection rate for real-world experiments for α = 0.05 G.2. Data GSM8K, which contains 87,900 grade school math problems. MATH (Hendrycks et al., 2021), which is a dataset of 12,500 challenging math problems, and AQUA-RAT (Algebra Question Answering with Rationals) (Ling et al., 2017), which contains about 100,000 algebra questions. G.3. Concept Shift To sample from the null distribution we use: θ n… view at source ↗

read the original abstract

We introduce a testing-by-betting framework that leverages predictions on unlabeled data to enhance the power of sequential hypothesis testing. Given limited samples from the joint distribution of $(X,Y)$, and additional unlabeled samples from the marginal of $X$, we ask how unlabeled data can be used to hypothesize about the distribution of $Y$, and the conditional distribution of $Y\mid X$. We introduce an e-statistic and use it to construct a sequential test. Under standard distributional assumptions -- label shift or concept shift -- we establish that the test is anytime valid. Furthermore, we show that for binary data, the e-statistic has non-trivial power. Crucially, our approach retains these properties even when the underlying predictions are inaccurate. Through simulations and applications to large language models evaluation, we demonstrate power gains over baseline approaches, including prediction-powered inference. These gains persist even with relatively limited unlabeled data and when predictions have low accuracy due to weak correlation between $X$ and $Y$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds an e-statistic from predictions on unlabeled X to run anytime-valid sequential tests under label or concept shift, with claimed power even for weak predictors.

read the letter

The main takeaway is a betting framework that folds predictions on extra unlabeled samples into sequential hypothesis testing while preserving anytime validity under standard shift assumptions. The e-statistic is constructed so that validity does not require the predictor to be accurate, only that the distributional shift conditions hold; power is shown non-trivial for binary outcomes.

The work integrates e-values with semi-supervised ideas in a way that avoids the usual dependence on predictor quality for the validity guarantee. Simulations and the LLM evaluation example show power improvements over prediction-powered inference baselines, and these hold with modest amounts of unlabeled data. That is a practical plus for settings where labels are expensive.

The assumptions are the usual ones (label shift or concept shift), and the robustness claim to inaccurate predictions is the part that stands out. The citation pattern tracks the relevant e-statistics and prediction-powered lines without obvious gaps.

A soft spot is the power result being stated only for binary data; how much the unlabeled samples actually help when X and Y are weakly correlated is shown empirically but not bounded tightly in general. The abstract gives no explicit form for the e-statistic, so the exact supermartingale construction and any hidden constants would need checking in the full derivations. No load-bearing circularity appears.

This is for people working on sequential testing or efficient evaluation in ML. It deserves a serious referee because the anytime-validity-plus-robustness angle is a clean extension worth verifying, even if the gains are incremental in some regimes.

Referee Report

0 major / 3 minor

Summary. The paper introduces a testing-by-betting framework for semi-supervised sequential hypothesis testing. Given limited labeled samples from the joint (X,Y) distribution and additional unlabeled samples from the marginal of X, it constructs an e-statistic from predictions on the unlabeled X's. Under label shift or concept shift, the resulting e-process is shown to be a supermartingale (anytime valid). For binary Y the e-statistic is claimed to have non-trivial power even when the predictor is arbitrarily inaccurate. Simulations and an LLM-evaluation application demonstrate power gains over baselines including prediction-powered inference, with the gains persisting under limited unlabeled data and low-accuracy predictions.

Significance. If the validity and power claims hold, the work would extend the betting/e-process literature to a practically relevant semi-supervised regime while preserving the key robustness property that validity does not require predictor accuracy. The ability to obtain power improvements from unlabeled data even when X and Y are only weakly correlated is a potentially useful contribution for sequential testing in machine-learning evaluation and other label-scarce domains.

minor comments (3)

The abstract and introduction state that the e-statistic is constructed from predictions on unlabeled X samples, but the precise functional form (how the prediction is turned into the betting payoff) is not visible in the provided abstract; a short explicit definition or reference to the relevant equation in §3 would improve readability.
The power claim is stated for binary data; it would be helpful to clarify in the main text whether the non-trivial power result extends to the multi-class or continuous-Y settings mentioned in the broader framework, or whether it is deliberately restricted.
The experimental section compares against prediction-powered inference; a brief discussion of why the betting construction yields gains even when the underlying predictor is weak would strengthen the narrative (currently only shown empirically).

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the recognition of its potential contributions to semi-supervised sequential testing, and the recommendation for minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs an e-statistic from predictions on unlabeled X samples and proves the resulting e-process is a supermartingale (hence the sequential test is anytime valid) under the external distributional assumptions of label shift or concept shift. These assumptions are standard and independent of the paper's fitted quantities or prior self-citations. No equation reduces a claimed prediction to a fitted parameter by construction, no uniqueness theorem is imported from the authors' prior work, and the non-trivial power result for binary Y is stated to hold even for inaccurate predictors. The derivation chain is therefore self-contained against external benchmarks rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; all claims rest on the unelaborated distributional assumptions of label shift or concept shift.

pith-pipeline@v0.9.1-grok · 5696 in / 1048 out tokens · 37639 ms · 2026-06-29T13:51:45.154440+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 6 canonical work pages · 4 internal anchors

[1]

PPI++: Efficient Prediction-Powered Inference

Angelopoulos, A. N., Bates, S., Fannjiang, C., Jordan, M. I., and Zrnic, T. Prediction-powered inference.Science, 382 (6671):669–674, 2023a. Angelopoulos, A. N., Duchi, J. C., and Zrnic, T. PPI++: Efficient prediction-powered inference.arXiv preprint arXiv:2311.01453, 2023b. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, ...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

J., and Goedert, G

Csillag, D., Struchiner, C. J., and Goedert, G. T. Prediction- powered e-values.arXiv preprint arXiv:2502.04294,

work page arXiv
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URL https://arxiv.org/abs/2501.12948. Ding, F., Hardt, M., Miller, J., and Schmidt, L. Retiring adult: New datasets for fair machine learning.Advances in neural information processing systems, 34:6478–6490,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Tabpfn: A transformer that solves small tabular classifica- tion problems in a second

Hollmann, N., M¨uller, S., Eggensperger, K., and Hutter, F. Tabpfn: A transformer that solves small tabular classifica- tion problems in a second. InThe Eleventh International Conference on Learning Representations, ICLR 2023,

2023
[6]

Etude critique de la notion de collectif, gauthier- villars, paris, 1939.Monographies des Probabilit ´es

Ville, J. Etude critique de la notion de collectif, gauthier- villars, paris, 1939.Monographies des Probabilit ´es. Cal- cul des Probabilit´es et ses Applications,

1939
[7]

Waudby-Smith, I., Sandoval, R., and Jordan, M. I. Universal log-optimality for general classes of e- processes and sequential hypothesis tests.arXiv preprint arXiv:2504.02818,

work page arXiv
[8]

Qwen2.5 Technical Report

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y ., Su, Y ., Zhang, Y ., Wan, Y ., Liu, Y ....

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Background A.1

10 Semi-Supervised Hypothesis Testing by Betting on Predictions A. Background A.1. First-order stochastic dominance (FOSD). The notion of stochastic order and FOSD in particular are at the core of this work. Formally, let X and Y be real-valued random variables with cumulative distribution functions FX(t) =P(X≤t) and FY (t) =P(Y≤t) . We say that Y first-o...

2007
[10]

, given a sequence (X1, Y1, ˜X 1), . . . ,(Xt, Yt, ˜X t) of labeled and unlabeled data, we define the following betting procedure: At each step t, we betλ t of our wealth against the null and receive the payoff: ePPI[Xt, Yt, ˜X t] = 1 +λ t(wt −θ null Y ),(9) whereλ t ∈[−1/2,1/2]andw t is the PPI estimator for the mean ofY: wt =Y t +ϵ t   1 N X X∈ ˜X t ˆ...

2023
[11]

= 0.9. 27 Semi-Supervised Hypothesis Testing by Betting on Predictions eY LR eX LR PPI 0 200 400 Step 0.0 0.2 0.4 0.6 0.8 1.0Power (a)N= 30, low correlation 0 200 400 Step 0.0 0.2 0.4 0.6 0.8 1.0Power (b)N= 135, low correlation 0 200 400 Step 0.0 0.2 0.4 0.6 0.8 1.0Power (c)N= 30, high correlation 0 200 400 Step 0.0 0.2 0.4 0.6 0.8 1.0Power (d)N= 135, hig...

2021

[1] [1]

PPI++: Efficient Prediction-Powered Inference

Angelopoulos, A. N., Bates, S., Fannjiang, C., Jordan, M. I., and Zrnic, T. Prediction-powered inference.Science, 382 (6671):669–674, 2023a. Angelopoulos, A. N., Duchi, J. C., and Zrnic, T. PPI++: Efficient prediction-powered inference.arXiv preprint arXiv:2311.01453, 2023b. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, ...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

J., and Goedert, G

Csillag, D., Struchiner, C. J., and Goedert, G. T. Prediction- powered e-values.arXiv preprint arXiv:2502.04294,

work page arXiv

[3] [3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URL https://arxiv.org/abs/2501.12948. Ding, F., Hardt, M., Miller, J., and Schmidt, L. Retiring adult: New datasets for fair machine learning.Advances in neural information processing systems, 34:6478–6490,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Tabpfn: A transformer that solves small tabular classifica- tion problems in a second

Hollmann, N., M¨uller, S., Eggensperger, K., and Hutter, F. Tabpfn: A transformer that solves small tabular classifica- tion problems in a second. InThe Eleventh International Conference on Learning Representations, ICLR 2023,

2023

[6] [6]

Etude critique de la notion de collectif, gauthier- villars, paris, 1939.Monographies des Probabilit ´es

Ville, J. Etude critique de la notion de collectif, gauthier- villars, paris, 1939.Monographies des Probabilit ´es. Cal- cul des Probabilit´es et ses Applications,

1939

[7] [7]

Waudby-Smith, I., Sandoval, R., and Jordan, M. I. Universal log-optimality for general classes of e- processes and sequential hypothesis tests.arXiv preprint arXiv:2504.02818,

work page arXiv

[8] [8]

Qwen2.5 Technical Report

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y ., Su, Y ., Zhang, Y ., Wan, Y ., Liu, Y ....

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Background A.1

10 Semi-Supervised Hypothesis Testing by Betting on Predictions A. Background A.1. First-order stochastic dominance (FOSD). The notion of stochastic order and FOSD in particular are at the core of this work. Formally, let X and Y be real-valued random variables with cumulative distribution functions FX(t) =P(X≤t) and FY (t) =P(Y≤t) . We say that Y first-o...

2007

[10] [10]

, given a sequence (X1, Y1, ˜X 1), . . . ,(Xt, Yt, ˜X t) of labeled and unlabeled data, we define the following betting procedure: At each step t, we betλ t of our wealth against the null and receive the payoff: ePPI[Xt, Yt, ˜X t] = 1 +λ t(wt −θ null Y ),(9) whereλ t ∈[−1/2,1/2]andw t is the PPI estimator for the mean ofY: wt =Y t +ϵ t   1 N X X∈ ˜X t ˆ...

2023

[11] [11]

= 0.9. 27 Semi-Supervised Hypothesis Testing by Betting on Predictions eY LR eX LR PPI 0 200 400 Step 0.0 0.2 0.4 0.6 0.8 1.0Power (a)N= 30, low correlation 0 200 400 Step 0.0 0.2 0.4 0.6 0.8 1.0Power (b)N= 135, low correlation 0 200 400 Step 0.0 0.2 0.4 0.6 0.8 1.0Power (c)N= 30, high correlation 0 200 400 Step 0.0 0.2 0.4 0.6 0.8 1.0Power (d)N= 135, hig...

2021