arxiv: 2605.11638 · v1 · submitted 2026-05-12 · 📊 stat.ML · cs.LG

Recognition: 1 theorem link

· Lean Theorem

Learning U-Statistics with Active Inference

Changliang Zou, Liuhua Peng, Xiaoning Wang, Yuyang Huo

Pith reviewed 2026-05-13 01:12 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords U-statisticsactive inferenceinverse probability weightingactive samplingestimation efficiencylabel budgetempirical risk minimization

0 comments

The pith

Active sampling guided by machine learning predictions reduces variance in U-statistic estimates while preserving valid inference under a fixed label budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an active inference framework for U-statistics that selectively queries the most informative labels to improve estimation efficiency when labeling is costly. It constructs this around an augmented inverse probability weighting U-statistic that folds in both the sampling decisions and auxiliary machine learning predictions. A sympathetic reader would care because U-statistics underpin many nonparametric procedures, and being able to reach the same precision with fewer labels makes those procedures feasible in settings where data acquisition is expensive. The authors derive the sampling probabilities that minimize asymptotic variance and supply practical rules, then show how the same idea carries over to U-statistic-based empirical risk minimization. Experiments on real data confirm efficiency gains while coverage remains on target.

Core claim

We develop an active inference framework for U-statistics that selectively queries informative labels to improve estimation efficiency under a fixed labeling budget, while preserving valid statistical inference. Our approach is built on the augmented inverse probability weighting U-statistic, which is designed to incorporate the sampling rule and machine learning predictions. We characterize the optimal sampling rule that minimizes its variance and design practical sampling strategies. We further extend the framework to U-statistic-based empirical risk minimization.

What carries the argument

The augmented inverse probability weighting U-statistic, which reweights the standard U-statistic by the inverse probability of selection and adds a correction term based on machine learning predictions to account for active sampling.

If this is right

U-statistic estimators achieve strictly lower asymptotic variance than passive uniform sampling for any fixed labeling budget when the optimal rule is used.
The estimator remains asymptotically normal, so standard confidence intervals and tests stay valid.
The same active framework applies directly to U-statistic empirical risk minimization, improving model training efficiency under label constraints.
Practical sampling strategies derived from the optimal rule deliver measurable efficiency gains on real data while maintaining coverage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If machine learning predictors continue to improve, the efficiency advantage of this active framework would grow without changing the core procedure.
The same weighting idea could be adapted to other semiparametric estimators that depend on pairwise or higher-order kernels.
Sequential or online versions of the sampling rule might further reduce the total labels needed in streaming settings.

Load-bearing premise

The machine learning predictions used to guide sampling are accurate enough not to introduce bias, and the augmented IPW U-statistic correctly incorporates the active sampling rule without invalidating the inference properties.

What would settle it

If experiments using deliberately inaccurate machine learning predictions show that variance reduction disappears or confidence-interval coverage falls below the nominal level, the practical value of the framework would be refuted.

Figures

Figures reproduced from arXiv: 2605.11638 by Changliang Zou, Liuhua Peng, Xiaoning Wang, Yuyang Huo.

**Figure 1.** Figure 1: Income (ACS) dataset. Estimation of the Gini index as a measure of population income inequality. Left: 90% confidence intervals for the Gini index estimate at budgets nb ∈ {800, 4800, 8800, 12000, 16000}. Middle: empirical coverage of the intervals (dashed line denotes the target coverage level 90%). Right: effective sample size neff , with shaded ±1 standard-deviation bands. Each dataset uses a different … view at source ↗

**Figure 2.** Figure 2: Perioperative dataset (VitalDB). Estimation of the Wilcoxon signed-rank test statistic to assess whether hemoglobin exhibits a systematic shift at anesthesia induction. Left: 90% confidence intervals for the estimate at budgets nb ∈ {76, 136, 196, 244, 304}. Middle: empirical coverage of the intervals. Right: effective sample size neff , with shaded ±1 standard-deviation bands. extreme or noisy measurement… view at source ↗

**Figure 3.** Figure 3: Political bias dataset. Estimation of Kendall’s τ to detect whether LLM-based labels distort the underlying ordinal structure and thereby bias downstream audits. Left: 90% confidence intervals for estimate at budgets nb ∈ {19, 114, 209, 285, 380}. Middle: empirical coverage of the intervals. Right: effective sample size neff , with shaded ±1 standard-deviation bands. applied with suboptimal policies, ident… view at source ↗

**Figure 4.** Figure 4: Save in sample budget due to active inference. Reduction in sample size required to achieve the same confidence interval width across the applications shown in Figures. 1–3. C. Additional experiment results C.1. Additional Experiment on a Higher-Order U-Statistic We further consider the UCI Bike Sharing dataset, which contains hourly and daily records of bike rental counts together with weather and calenda… view at source ↗

**Figure 5.** Figure 5: Estimation of Gini index. Left: empirical coverage of the intervals. right: effective sample size neff , with shaded ±1 standard-deviation bands. C.3. Sensitivity to Predictive-Model Misspecification We further examine the sensitivity of the proposed method to predictive-model misspecification. The validity of the AIPW U-statistic does not require the predictive model µ to be correctly specified, since the… view at source ↗

**Figure 6.** Figure 6: Average MSE under different choices of sample budget nb. C.5. U-estimator We conduct a simulation study to evaluate the performance of the proposed active U-estimator compared with two baseline methods under varying data conditions: • noML: uses only the labeled subset (X, Y ) selected by uniform sampling. The parameter θ is obtained by minimizing Lnb (θ) with the pairwise logistic loss. • Semi: pseudo-lab… view at source ↗

**Figure 7.** Figure 7: Left: Average MSE for the U-estimation under different choices of sample budget nb. Right: neff for the U-estimator under different choices of sample budget nb. D. Discussion on Degenerate U-Statistics Our main theory focuses on the non-degenerate regime, in which the first-order Hoeffding projection determines the leading asymptotic behavior. In this setting, the optimal sampling rule is driven by the res… view at source ↗

read the original abstract

$U$-statistics play a central role in statistical inference. In many modern applications, however, acquiring the labels required for $U$-statistics is costly. Motivated by recent advances in active inference, we develop an active inference framework for $U$-statistics that selectively queries informative labels to improve estimation efficiency under a fixed labeling budget, while preserving valid statistical inference. Our approach is built on the augmented inverse probability weighting $U$-statistic, which is designed to incorporate the sampling rule and machine learning predictions. We characterize the optimal sampling rule that minimizes its variance and design practical sampling strategies. We further extend the framework to $U$-statistic-based empirical risk minimization. Experiments on real datasets demonstrate substantial gains in estimation efficiency over baseline methods, while maintaining target coverage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper combines active inference with U-statistics via augmented IPW but needs to confirm unbiasedness for pairs under dependent sampling.

read the letter

The main takeaway is that this paper develops an active sampling framework for U-statistics. It uses an augmented inverse probability weighting estimator that brings in machine learning predictions and the sampling rule, then derives an optimal sampling strategy to minimize variance under a fixed label budget. They also extend the idea to U-statistic empirical risk minimization and run experiments on real data showing efficiency gains while holding coverage at the nominal level. That combination and the variance characterization are the actual new pieces. The experiments provide concrete evidence that the practical strategies deliver measurable savings without breaking the inference properties on the tested cases. The work is grounded enough in the U-statistic literature to make the extension feel natural rather than forced. The soft spot is exactly the one in the stress-test note. U-statistics are built on pairs, so the inclusion probability for any (i,j) term is the product of the two individual probabilities only if the sampling decisions are independent. An active rule guided by predictions can introduce dependence across instances, either through the global budget or through correlation between the predictor and the labels. If the augmentation does not explicitly cancel the resulting bias term, the estimator is no longer unbiased and the variance formula does not hold as stated. The abstract claims the rule is incorporated, but without the explicit pairwise derivation visible here the validity claim rests on an unverified step. This is not a fatal flaw yet, but it is the load-bearing part of the “preserving valid statistical inference” promise. The paper is aimed at statisticians and machine learning researchers who already use U-statistics for ranking, kernel methods, or AUC-type metrics and who face labeling costs. A reader looking for label-efficient extensions of classical estimators would get usable ideas from the framework and the reported gains. It deserves peer review so the unbiasedness derivation can be checked directly and the experiments can be stress-tested for robustness.

Referee Report

1 major / 0 minor

Summary. The manuscript develops an active inference framework for U-statistics that selectively queries labels using machine learning predictions to improve estimation efficiency under a fixed labeling budget. It is built on an augmented inverse probability weighting U-statistic designed to incorporate the sampling rule and predictions while preserving valid statistical inference, characterizes the optimal sampling rule that minimizes variance, extends the approach to U-statistic-based empirical risk minimization, and reports experimental gains in efficiency with maintained coverage on real datasets.

Significance. If the unbiasedness and variance characterization hold under active sampling, the framework could meaningfully advance efficient nonparametric inference in label-scarce regimes by bridging active learning with U-statistic theory. The explicit characterization of the optimal sampling rule and the extension to empirical risk minimization would be notable contributions if rigorously established.

major comments (1)

The central claim that the augmented IPW U-statistic preserves unbiasedness and valid inference under the active per-instance sampling rule (guided by ML predictions) is load-bearing for the entire contribution. Because U-statistics are defined over pairs, the pair inclusion probability is the product of individual probabilities only under independent sampling; any dependence induced by the global active rule or by correlation between the predictor and labels could invalidate the simple per-instance augmentation. The manuscript should provide an explicit derivation (in the section introducing the augmented IPW U-statistic) showing that the augmentation term cancels the bias for all pairwise terms and that the variance formula remains correct under the stated sampling probabilities.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for highlighting the importance of rigorously establishing the unbiasedness properties of the augmented IPW U-statistic. We address the major comment in detail below and will incorporate the requested clarification into the revised version.

read point-by-point responses

Referee: The central claim that the augmented IPW U-statistic preserves unbiasedness and valid inference under the active per-instance sampling rule (guided by ML predictions) is load-bearing for the entire contribution. Because U-statistics are defined over pairs, the pair inclusion probability is the product of individual probabilities only under independent sampling; any dependence induced by the global active rule or by correlation between the predictor and labels could invalidate the simple per-instance augmentation. The manuscript should provide an explicit derivation (in the section introducing the augmented IPW U-statistic) showing that the augmentation term cancels the bias for all pairwise terms and that the variance formula remains correct under the stated sampling probabilities.

Authors: We agree that an explicit derivation is essential for establishing the validity of the framework. In the revised manuscript we will insert a dedicated subsection immediately following the definition of the augmented IPW U-statistic. The derivation proceeds as follows: the active sampling rule is implemented via independent Bernoulli draws with success probabilities π_i that depend only on the (fixed) machine-learning predictions for each instance; this is realized through a Poisson sampling scheme that respects the overall labeling budget in expectation while preserving independence of the inclusion indicators I_i. Consequently, P(I_i = 1, I_j = 1) = π_i π_j exactly. For any pair term, the estimator contribution is I_i I_j h(Y_i, Y_j) / (π_i π_j) augmented by correction terms that replace missing observations with the known predictions (scaled by the appropriate inclusion probabilities). Taking the conditional expectation given the predictions shows that each augmentation term exactly cancels the bias introduced by the missing indicators, yielding E[augmented pair term | predictions] = E[h(Y_i, Y_j) | predictions]. Unconditioning then recovers the unconditional expectation. The same conditioning argument shows that the variance formula, which already accounts for the second-moment structure of the indicators, remains valid; the correlation between the predictor and the labels is absorbed into the conditional expectations and does not affect the unbiasedness. We will also add a short remark clarifying that the Poisson approximation introduces only a negligible dependence that vanishes asymptotically under standard budget scaling. revision: yes

Circularity Check

0 steps flagged

No circularity: framework derives optimal rule from variance expression without self-definition or fitted-input renaming

full rationale

The abstract and skeptic summary describe a standard construction: an augmented IPW U-statistic is defined to incorporate a given sampling rule and ML predictions, after which the optimal rule is characterized by minimizing the resulting variance expression. This is a conventional optimization step (minimize Var(estimator) w.r.t. sampling probabilities) rather than a reduction by construction. No quoted equation shows the optimal rule being presupposed in the estimator definition, no self-citation is invoked as a uniqueness theorem, and no parameter fitted on one subset is relabeled as a prediction on another. The derivation chain therefore remains self-contained with independent mathematical content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be identified or audited.

pith-pipeline@v0.9.0 · 5422 in / 1037 out tokens · 44919 ms · 2026-05-13T01:12:06.217612+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
augmented inverse probability weighting U-statistic... optimal sampling rule that minimizes its variance

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 1 internal anchor

[1]

Scientific Data , year =

VitalDB, a high-fidelity multi-parameter vital signs database in surgical patients , author =. Scientific Data , year =

work page
[2]

Advances in Neural Information Processing Systems , volume =

Retiring Adult: New Datasets for Fair Machine Learning , author =. Advances in Neural Information Processing Systems , volume =

work page
[3]

International Conference on Machine Learning , pages=

Active Statistical Inference , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[4]

The Annals of Statistics , number =

Ilmun Kim and Larry Wasserman and Sivaraman Balakrishnan and Matey Neykov , title =. The Annals of Statistics , number =

work page
[5]

2008 , note =

Theil-Sen Estimators in a Multiple Linear Regression Model , author =. 2008 , note =

work page 2008
[6]

arXiv:2511.08991 , year=

Robust Sampling for Active Statistical Inference , author=. arXiv:2511.08991 , year=

work page arXiv
[7]

D. G. Horvitz and D. J. Thompson , title =. Journal of the American Statistical Association , volume =. 1952 , publisher =

work page 1952
[8]

Biometrika , volume=

Some results on generalized difference estimation and generalized regression estimation for finite populations , author=. Biometrika , volume=. 1976 , publisher=

work page 1976
[9]

Journal of Causal Inference , volume=

Adaptive normalization for IPW estimation , author=. Journal of Causal Inference , volume=. 2023 , publisher=

work page 2023
[10]

Statistica Sinica , volume=

Design based incomplete U-statistics , author=. Statistica Sinica , volume=. 2021 , publisher=

work page 2021
[11]

Biometrics , volume=

A new u-statistic with superior design sensitivity in matched observational studies , author=. Biometrics , volume=. 2011 , publisher=

work page 2011
[12]

Journal of the American Statistical Association , number=

U-Statistic Reduction: Higher-Order Accurate Risk Control and Statistical-Computational Trade-Off , author=. Journal of the American Statistical Association , number=. 2025 , publisher=

work page 2025
[13]

arXiv preprint arXiv:2311.01453 , year=

Ppi++: Efficient prediction-powered inference , author=. arXiv preprint arXiv:2311.01453 , year=

work page arXiv
[14]

Biometrika , volume=

Semi-supervised distribution learning , author=. Biometrika , volume=. 2025 , publisher=

work page 2025
[15]

Neural Information Processing Systems , year=

Balanced Active Inference , author=. Neural Information Processing Systems , year=

work page
[16]

2019 , publisher=

U-statistics: Theory and Practice , author=. 2019 , publisher=

work page 2019
[17]

Metron-International Journal of Statistics , volume=

Gini’s Mean difference: a superior measure of variability for non-normal distributions , author=. Metron-International Journal of Statistics , volume=. 2003 , publisher=

work page 2003
[18]

Biometrics , pages=

Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach , author=. Biometrics , pages=. 1988 , publisher=

work page 1988
[19]

, author=

Rank correlation methods. , author=. 1948 , publisher=

work page 1948
[20]

The Annals of Mathematical Statistics , pages=

On a test of whether one of two random variables is stochastically larger than the other , author=. The Annals of Mathematical Statistics , pages=. 1947 , publisher=

work page 1947
[21]

The Annals of Mathematical Statistics , number =

Wassily Hoeffding , title =. The Annals of Mathematical Statistics , number =

work page
[22]

, author=

Generalization bounds for ranking algorithms via algorithmic stability. , author=. Journal of Machine Learning Research , volume=

work page
[23]

2008 , journal=

Ranking and empirical minimization of U-statistics , author=. 2008 , journal=

work page 2008
[24]

International Conference on Machine Learning , pages=

A theoretical analysis of contrastive unsupervised representation learning , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[25]

Advances in Neural Information Processing Systems , volume=

Regularized distance metric learning: Theory and algorithm , author=. Advances in Neural Information Processing Systems , volume=

work page
[26]

Computers, Materials and Continua , volume=

Medical diagnosis using machine learning: a statistical review , author=. Computers, Materials and Continua , volume=. 2021 , publisher=

work page 2021
[27]

The Review of Financial Studies , volume=

Empirical asset pricing via machine learning , author=. The Review of Financial Studies , volume=. 2020 , publisher=

work page 2020
[28]

Annual Review of Political Science , volume=

Machine learning for social science: An agnostic approach , author=. Annual Review of Political Science , volume=. 2021 , publisher=

work page 2021
[29]

ACM computing surveys (CSUR) , volume=

A survey of deep active learning , author=. ACM computing surveys (CSUR) , volume=. 2021 , publisher=

work page 2021
[30]

arXiv preprint arXiv:2506.10908 , year=

Probably Approximately Correct Labels , author=. arXiv preprint arXiv:2506.10908 , year=

work page arXiv
[31]

Active Hypothesis Testing under Computational Budgets with Applications to GWAS and LLM

Active Hypothesis Testing under Computational Budgets with Applications to GWAS and LLM , author=. arXiv preprint arXiv:2512.01423 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

arXiv preprint arXiv:2502.11032 , year=

Exact variance estimation for model-assisted survey estimators using U-and V-statistics , author=. arXiv preprint arXiv:2502.11032 , year=

work page arXiv
[33]

arXiv preprint arXiv:2506.07949 , year=

Cost-Optimal Active AI Model Evaluation , author=. arXiv preprint arXiv:2506.07949 , year=

work page arXiv
[34]

International Conference on Machine Learning , pages=

Active Adaptive Experimental Design for Treatment Effect Estimation with Covariate Choice , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[35]

Can unconfident llm annotations be used for confident conclusions? , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2025
[36]

arXiv preprint arXiv:2501.18577 , year=

Prediction-powered inference with imputed covariates and nonuniform sampling , author=. arXiv preprint arXiv:2501.18577 , year=

work page arXiv
[37]

Advances in Neural Information Processing Systems , volume=

On differentially private U statistics , author=. Advances in Neural Information Processing Systems , volume=

work page
[38]

The Annals of Statistics , volume=

Distributed statistical inference for massive data , author=. The Annals of Statistics , volume=. 2021 , publisher=

work page 2021
[39]

2019 , journal=

Randomized incomplete U-statistics in high dimensions , author=. 2019 , journal=

work page 2019
[40]

Journal of Machine Learning Research , volume=

Scaling-up empirical risk minimization: optimization of incomplete U -statistics , author=. Journal of Machine Learning Research , volume=

work page
[41]

Journal of Machine Learning Research , volume=

Distributed algorithms for U-statistics-based empirical risk minimization , author=. Journal of Machine Learning Research , volume=

work page
[42]

2009 , publisher=

Approximation theorems of mathematical statistics , author=. 2009 , publisher=

work page 2009
[43]

Annals of Statistics , volume=

Gaussian and bootstrap approximations for high-dimensional U-statistics and their applications , author=. Annals of Statistics , volume=. 2018 , publisher=

work page 2018
[44]

Journal of Machine Learning Research , year =

Yining Wang and Adams Wei Yu and Aarti Singh , title =. Journal of Machine Learning Research , year =

work page
[45]

2012 , publisher=

Active learning , author=. 2012 , publisher=

work page 2012
[46]

Science , volume=

Prediction-powered inference , author=. Science , volume=. 2023 , publisher=

work page 2023
[47]

Proceedings of the National Academy of Sciences , volume=

Cross-prediction-powered inference , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

work page 2024
[48]

Advances in Neural Information Processing Systems , volume=

Prediction-powered ranking of large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[49]

Brown and T

Anru Zhang and Lawrence D. Brown and T. Tony Cai , title =. The Annals of Statistics , number =

work page
[50]

Advances in Neural Information Processing Systems , year=

Doubly Robust Self-Training , author=. Advances in Neural Information Processing Systems , year=

work page
[51]

Advances in Neural Information Processing Systems , volume=

Empirical Bernstein inequalities for u-statistics , author=. Advances in Neural Information Processing Systems , volume=

work page
[52]

Advances in Neural Information Processing Systems , volume=

Generalization guarantee of SGD for pairwise learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[53]

Foundations and trends

Learning to rank for information retrieval , author=. Foundations and trends. 2009 , publisher=

work page 2009
[54]

Seminars in nuclear medicine , volume=

Basic principles of ROC analysis , author=. Seminars in nuclear medicine , volume=. 1978 , organization=

work page 1978
[55]

International Conference on Machine Learning , pages=

One-pass AUC optimization , author=. International Conference on Machine Learning , pages=. 2013 , organization=

work page 2013
[56]

Journal of the American Statistical Association , volume=

Optimal subsampling for large sample logistic regression , author=. Journal of the American Statistical Association , volume=. 2018 , publisher=

work page 2018
[57]

The Journal of Machine Learning Research , volume=

A statistical perspective on algorithmic leveraging , author=. The Journal of Machine Learning Research , volume=. 2015 , publisher=

work page 2015
[58]

Journal of the Royal Statistical Society Series B: Statistical Methodology , pages=

Multi-resolution subsampling for linear classification with massive data , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , pages=. 2025 , publisher=

work page 2025
[59]

International Conference on Artificial Intelligence and Statistics , pages=

Deep active learning: Unified and principled method for query and training , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

work page 2020
[60]

Enhancing Deep Batch Active Learning for Regression with Imperfect Data Guided Selection , volume =

Min, Yinjie and Xu, Furong and Li, Xinyao and Zou, Changliang and Zhou, Yongdao , booktitle =. Enhancing Deep Batch Active Learning for Regression with Imperfect Data Guided Selection , volume =

work page
[61]

, title =

Kiefer, J. , title =. Journal of the Royal Statistical Society: Series B (Methodological) , volume =

work page
[62]

Journal of Machine Learning Research , volume=

Monte carlo gradient estimation in machine learning , author=. Journal of Machine Learning Research , volume=

work page
[63]

2003 , publisher=

Model assisted survey sampling , author=. 2003 , publisher=

work page 2003
[64]

Symposium on Monte Carlo Methods , editor =

Trotter, Hale F and Tukey, John W , title =. Symposium on Monte Carlo Methods , editor =

work page
[65]

arXiv preprint arXiv:2204.14121 , year=

Inverse probability weighting: from survey sampling to evidence estimation , author=. arXiv preprint arXiv:2204.14121 , year=

work page arXiv
[66]

Biometrics Bulletin , volume=

Individual comparisons by ranking methods , author=. Biometrics Bulletin , volume=. 1945 , publisher=

work page 1945