pith. machine review for the scientific record. sign in

arxiv: 2605.11638 · v1 · submitted 2026-05-12 · 📊 stat.ML · cs.LG

Recognition: 1 theorem link

· Lean Theorem

Learning U-Statistics with Active Inference

Changliang Zou, Liuhua Peng, Xiaoning Wang, Yuyang Huo

Pith reviewed 2026-05-13 01:12 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords U-statisticsactive inferenceinverse probability weightingactive samplingestimation efficiencylabel budgetempirical risk minimization
0
0 comments X

The pith

Active sampling guided by machine learning predictions reduces variance in U-statistic estimates while preserving valid inference under a fixed label budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an active inference framework for U-statistics that selectively queries the most informative labels to improve estimation efficiency when labeling is costly. It constructs this around an augmented inverse probability weighting U-statistic that folds in both the sampling decisions and auxiliary machine learning predictions. A sympathetic reader would care because U-statistics underpin many nonparametric procedures, and being able to reach the same precision with fewer labels makes those procedures feasible in settings where data acquisition is expensive. The authors derive the sampling probabilities that minimize asymptotic variance and supply practical rules, then show how the same idea carries over to U-statistic-based empirical risk minimization. Experiments on real data confirm efficiency gains while coverage remains on target.

Core claim

We develop an active inference framework for U-statistics that selectively queries informative labels to improve estimation efficiency under a fixed labeling budget, while preserving valid statistical inference. Our approach is built on the augmented inverse probability weighting U-statistic, which is designed to incorporate the sampling rule and machine learning predictions. We characterize the optimal sampling rule that minimizes its variance and design practical sampling strategies. We further extend the framework to U-statistic-based empirical risk minimization.

What carries the argument

The augmented inverse probability weighting U-statistic, which reweights the standard U-statistic by the inverse probability of selection and adds a correction term based on machine learning predictions to account for active sampling.

If this is right

  • U-statistic estimators achieve strictly lower asymptotic variance than passive uniform sampling for any fixed labeling budget when the optimal rule is used.
  • The estimator remains asymptotically normal, so standard confidence intervals and tests stay valid.
  • The same active framework applies directly to U-statistic empirical risk minimization, improving model training efficiency under label constraints.
  • Practical sampling strategies derived from the optimal rule deliver measurable efficiency gains on real data while maintaining coverage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If machine learning predictors continue to improve, the efficiency advantage of this active framework would grow without changing the core procedure.
  • The same weighting idea could be adapted to other semiparametric estimators that depend on pairwise or higher-order kernels.
  • Sequential or online versions of the sampling rule might further reduce the total labels needed in streaming settings.

Load-bearing premise

The machine learning predictions used to guide sampling are accurate enough not to introduce bias, and the augmented IPW U-statistic correctly incorporates the active sampling rule without invalidating the inference properties.

What would settle it

If experiments using deliberately inaccurate machine learning predictions show that variance reduction disappears or confidence-interval coverage falls below the nominal level, the practical value of the framework would be refuted.

Figures

Figures reproduced from arXiv: 2605.11638 by Changliang Zou, Liuhua Peng, Xiaoning Wang, Yuyang Huo.

Figure 1
Figure 1. Figure 1: Income (ACS) dataset. Estimation of the Gini index as a measure of population income inequality. Left: 90% confidence intervals for the Gini index estimate at budgets nb ∈ {800, 4800, 8800, 12000, 16000}. Middle: empirical coverage of the intervals (dashed line denotes the target coverage level 90%). Right: effective sample size neff , with shaded ±1 standard-deviation bands. Each dataset uses a different … view at source ↗
Figure 2
Figure 2. Figure 2: Perioperative dataset (VitalDB). Estimation of the Wilcoxon signed-rank test statistic to assess whether hemoglobin exhibits a systematic shift at anesthesia induction. Left: 90% confidence intervals for the estimate at budgets nb ∈ {76, 136, 196, 244, 304}. Middle: empirical coverage of the intervals. Right: effective sample size neff , with shaded ±1 standard-deviation bands. extreme or noisy measurement… view at source ↗
Figure 3
Figure 3. Figure 3: Political bias dataset. Estimation of Kendall’s τ to detect whether LLM-based labels distort the underlying ordinal structure and thereby bias downstream audits. Left: 90% confidence intervals for estimate at budgets nb ∈ {19, 114, 209, 285, 380}. Middle: empirical coverage of the intervals. Right: effective sample size neff , with shaded ±1 standard-deviation bands. applied with suboptimal policies, ident… view at source ↗
Figure 4
Figure 4. Figure 4: Save in sample budget due to active inference. Reduction in sample size required to achieve the same confidence interval width across the applications shown in Figures. 1–3. C. Additional experiment results C.1. Additional Experiment on a Higher-Order U-Statistic We further consider the UCI Bike Sharing dataset, which contains hourly and daily records of bike rental counts together with weather and calenda… view at source ↗
Figure 5
Figure 5. Figure 5: Estimation of Gini index. Left: empirical coverage of the intervals. right: effective sample size neff , with shaded ±1 standard-deviation bands. C.3. Sensitivity to Predictive-Model Misspecification We further examine the sensitivity of the proposed method to predictive-model misspecification. The validity of the AIPW U-statistic does not require the predictive model µ to be correctly specified, since the… view at source ↗
Figure 6
Figure 6. Figure 6: Average MSE under different choices of sample budget nb. C.5. U-estimator We conduct a simulation study to evaluate the performance of the proposed active U-estimator compared with two baseline methods under varying data conditions: • noML: uses only the labeled subset (X, Y ) selected by uniform sampling. The parameter θ is obtained by minimizing Lnb (θ) with the pairwise logistic loss. • Semi: pseudo-lab… view at source ↗
Figure 7
Figure 7. Figure 7: Left: Average MSE for the U-estimation under different choices of sample budget nb. Right: neff for the U-estimator under different choices of sample budget nb. D. Discussion on Degenerate U-Statistics Our main theory focuses on the non-degenerate regime, in which the first-order Hoeffding projection determines the leading asymptotic behavior. In this setting, the optimal sampling rule is driven by the res… view at source ↗
read the original abstract

$U$-statistics play a central role in statistical inference. In many modern applications, however, acquiring the labels required for $U$-statistics is costly. Motivated by recent advances in active inference, we develop an active inference framework for $U$-statistics that selectively queries informative labels to improve estimation efficiency under a fixed labeling budget, while preserving valid statistical inference. Our approach is built on the augmented inverse probability weighting $U$-statistic, which is designed to incorporate the sampling rule and machine learning predictions. We characterize the optimal sampling rule that minimizes its variance and design practical sampling strategies. We further extend the framework to $U$-statistic-based empirical risk minimization. Experiments on real datasets demonstrate substantial gains in estimation efficiency over baseline methods, while maintaining target coverage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript develops an active inference framework for U-statistics that selectively queries labels using machine learning predictions to improve estimation efficiency under a fixed labeling budget. It is built on an augmented inverse probability weighting U-statistic designed to incorporate the sampling rule and predictions while preserving valid statistical inference, characterizes the optimal sampling rule that minimizes variance, extends the approach to U-statistic-based empirical risk minimization, and reports experimental gains in efficiency with maintained coverage on real datasets.

Significance. If the unbiasedness and variance characterization hold under active sampling, the framework could meaningfully advance efficient nonparametric inference in label-scarce regimes by bridging active learning with U-statistic theory. The explicit characterization of the optimal sampling rule and the extension to empirical risk minimization would be notable contributions if rigorously established.

major comments (1)
  1. The central claim that the augmented IPW U-statistic preserves unbiasedness and valid inference under the active per-instance sampling rule (guided by ML predictions) is load-bearing for the entire contribution. Because U-statistics are defined over pairs, the pair inclusion probability is the product of individual probabilities only under independent sampling; any dependence induced by the global active rule or by correlation between the predictor and labels could invalidate the simple per-instance augmentation. The manuscript should provide an explicit derivation (in the section introducing the augmented IPW U-statistic) showing that the augmentation term cancels the bias for all pairwise terms and that the variance formula remains correct under the stated sampling probabilities.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for highlighting the importance of rigorously establishing the unbiasedness properties of the augmented IPW U-statistic. We address the major comment in detail below and will incorporate the requested clarification into the revised version.

read point-by-point responses
  1. Referee: The central claim that the augmented IPW U-statistic preserves unbiasedness and valid inference under the active per-instance sampling rule (guided by ML predictions) is load-bearing for the entire contribution. Because U-statistics are defined over pairs, the pair inclusion probability is the product of individual probabilities only under independent sampling; any dependence induced by the global active rule or by correlation between the predictor and labels could invalidate the simple per-instance augmentation. The manuscript should provide an explicit derivation (in the section introducing the augmented IPW U-statistic) showing that the augmentation term cancels the bias for all pairwise terms and that the variance formula remains correct under the stated sampling probabilities.

    Authors: We agree that an explicit derivation is essential for establishing the validity of the framework. In the revised manuscript we will insert a dedicated subsection immediately following the definition of the augmented IPW U-statistic. The derivation proceeds as follows: the active sampling rule is implemented via independent Bernoulli draws with success probabilities π_i that depend only on the (fixed) machine-learning predictions for each instance; this is realized through a Poisson sampling scheme that respects the overall labeling budget in expectation while preserving independence of the inclusion indicators I_i. Consequently, P(I_i = 1, I_j = 1) = π_i π_j exactly. For any pair term, the estimator contribution is I_i I_j h(Y_i, Y_j) / (π_i π_j) augmented by correction terms that replace missing observations with the known predictions (scaled by the appropriate inclusion probabilities). Taking the conditional expectation given the predictions shows that each augmentation term exactly cancels the bias introduced by the missing indicators, yielding E[augmented pair term | predictions] = E[h(Y_i, Y_j) | predictions]. Unconditioning then recovers the unconditional expectation. The same conditioning argument shows that the variance formula, which already accounts for the second-moment structure of the indicators, remains valid; the correlation between the predictor and the labels is absorbed into the conditional expectations and does not affect the unbiasedness. We will also add a short remark clarifying that the Poisson approximation introduces only a negligible dependence that vanishes asymptotically under standard budget scaling. revision: yes

Circularity Check

0 steps flagged

No circularity: framework derives optimal rule from variance expression without self-definition or fitted-input renaming

full rationale

The abstract and skeptic summary describe a standard construction: an augmented IPW U-statistic is defined to incorporate a given sampling rule and ML predictions, after which the optimal rule is characterized by minimizing the resulting variance expression. This is a conventional optimization step (minimize Var(estimator) w.r.t. sampling probabilities) rather than a reduction by construction. No quoted equation shows the optimal rule being presupposed in the estimator definition, no self-citation is invoked as a uniqueness theorem, and no parameter fitted on one subset is relabeled as a prediction on another. The derivation chain therefore remains self-contained with independent mathematical content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be identified or audited.

pith-pipeline@v0.9.0 · 5422 in / 1037 out tokens · 44919 ms · 2026-05-13T01:12:06.217612+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 1 internal anchor

  1. [1]

    Scientific Data , year =

    VitalDB, a high-fidelity multi-parameter vital signs database in surgical patients , author =. Scientific Data , year =

  2. [2]

    Advances in Neural Information Processing Systems , volume =

    Retiring Adult: New Datasets for Fair Machine Learning , author =. Advances in Neural Information Processing Systems , volume =

  3. [3]

    International Conference on Machine Learning , pages=

    Active Statistical Inference , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  4. [4]

    The Annals of Statistics , number =

    Ilmun Kim and Larry Wasserman and Sivaraman Balakrishnan and Matey Neykov , title =. The Annals of Statistics , number =

  5. [5]

    2008 , note =

    Theil-Sen Estimators in a Multiple Linear Regression Model , author =. 2008 , note =

  6. [6]

    arXiv:2511.08991 , year=

    Robust Sampling for Active Statistical Inference , author=. arXiv:2511.08991 , year=

  7. [7]

    D. G. Horvitz and D. J. Thompson , title =. Journal of the American Statistical Association , volume =. 1952 , publisher =

  8. [8]

    Biometrika , volume=

    Some results on generalized difference estimation and generalized regression estimation for finite populations , author=. Biometrika , volume=. 1976 , publisher=

  9. [9]

    Journal of Causal Inference , volume=

    Adaptive normalization for IPW estimation , author=. Journal of Causal Inference , volume=. 2023 , publisher=

  10. [10]

    Statistica Sinica , volume=

    Design based incomplete U-statistics , author=. Statistica Sinica , volume=. 2021 , publisher=

  11. [11]

    Biometrics , volume=

    A new u-statistic with superior design sensitivity in matched observational studies , author=. Biometrics , volume=. 2011 , publisher=

  12. [12]

    Journal of the American Statistical Association , number=

    U-Statistic Reduction: Higher-Order Accurate Risk Control and Statistical-Computational Trade-Off , author=. Journal of the American Statistical Association , number=. 2025 , publisher=

  13. [13]

    arXiv preprint arXiv:2311.01453 , year=

    Ppi++: Efficient prediction-powered inference , author=. arXiv preprint arXiv:2311.01453 , year=

  14. [14]

    Biometrika , volume=

    Semi-supervised distribution learning , author=. Biometrika , volume=. 2025 , publisher=

  15. [15]

    Neural Information Processing Systems , year=

    Balanced Active Inference , author=. Neural Information Processing Systems , year=

  16. [16]

    2019 , publisher=

    U-statistics: Theory and Practice , author=. 2019 , publisher=

  17. [17]

    Metron-International Journal of Statistics , volume=

    Gini’s Mean difference: a superior measure of variability for non-normal distributions , author=. Metron-International Journal of Statistics , volume=. 2003 , publisher=

  18. [18]

    Biometrics , pages=

    Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach , author=. Biometrics , pages=. 1988 , publisher=

  19. [19]

    , author=

    Rank correlation methods. , author=. 1948 , publisher=

  20. [20]

    The Annals of Mathematical Statistics , pages=

    On a test of whether one of two random variables is stochastically larger than the other , author=. The Annals of Mathematical Statistics , pages=. 1947 , publisher=

  21. [21]

    The Annals of Mathematical Statistics , number =

    Wassily Hoeffding , title =. The Annals of Mathematical Statistics , number =

  22. [22]

    , author=

    Generalization bounds for ranking algorithms via algorithmic stability. , author=. Journal of Machine Learning Research , volume=

  23. [23]

    2008 , journal=

    Ranking and empirical minimization of U-statistics , author=. 2008 , journal=

  24. [24]

    International Conference on Machine Learning , pages=

    A theoretical analysis of contrastive unsupervised representation learning , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  25. [25]

    Advances in Neural Information Processing Systems , volume=

    Regularized distance metric learning: Theory and algorithm , author=. Advances in Neural Information Processing Systems , volume=

  26. [26]

    Computers, Materials and Continua , volume=

    Medical diagnosis using machine learning: a statistical review , author=. Computers, Materials and Continua , volume=. 2021 , publisher=

  27. [27]

    The Review of Financial Studies , volume=

    Empirical asset pricing via machine learning , author=. The Review of Financial Studies , volume=. 2020 , publisher=

  28. [28]

    Annual Review of Political Science , volume=

    Machine learning for social science: An agnostic approach , author=. Annual Review of Political Science , volume=. 2021 , publisher=

  29. [29]

    ACM computing surveys (CSUR) , volume=

    A survey of deep active learning , author=. ACM computing surveys (CSUR) , volume=. 2021 , publisher=

  30. [30]

    arXiv preprint arXiv:2506.10908 , year=

    Probably Approximately Correct Labels , author=. arXiv preprint arXiv:2506.10908 , year=

  31. [31]

    Active Hypothesis Testing under Computational Budgets with Applications to GWAS and LLM

    Active Hypothesis Testing under Computational Budgets with Applications to GWAS and LLM , author=. arXiv preprint arXiv:2512.01423 , year=

  32. [32]

    arXiv preprint arXiv:2502.11032 , year=

    Exact variance estimation for model-assisted survey estimators using U-and V-statistics , author=. arXiv preprint arXiv:2502.11032 , year=

  33. [33]

    arXiv preprint arXiv:2506.07949 , year=

    Cost-Optimal Active AI Model Evaluation , author=. arXiv preprint arXiv:2506.07949 , year=

  34. [34]

    International Conference on Machine Learning , pages=

    Active Adaptive Experimental Design for Treatment Effect Estimation with Covariate Choice , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  35. [35]

    Can unconfident llm annotations be used for confident conclusions? , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  36. [36]

    arXiv preprint arXiv:2501.18577 , year=

    Prediction-powered inference with imputed covariates and nonuniform sampling , author=. arXiv preprint arXiv:2501.18577 , year=

  37. [37]

    Advances in Neural Information Processing Systems , volume=

    On differentially private U statistics , author=. Advances in Neural Information Processing Systems , volume=

  38. [38]

    The Annals of Statistics , volume=

    Distributed statistical inference for massive data , author=. The Annals of Statistics , volume=. 2021 , publisher=

  39. [39]

    2019 , journal=

    Randomized incomplete U-statistics in high dimensions , author=. 2019 , journal=

  40. [40]

    Journal of Machine Learning Research , volume=

    Scaling-up empirical risk minimization: optimization of incomplete U -statistics , author=. Journal of Machine Learning Research , volume=

  41. [41]

    Journal of Machine Learning Research , volume=

    Distributed algorithms for U-statistics-based empirical risk minimization , author=. Journal of Machine Learning Research , volume=

  42. [42]

    2009 , publisher=

    Approximation theorems of mathematical statistics , author=. 2009 , publisher=

  43. [43]

    Annals of Statistics , volume=

    Gaussian and bootstrap approximations for high-dimensional U-statistics and their applications , author=. Annals of Statistics , volume=. 2018 , publisher=

  44. [44]

    Journal of Machine Learning Research , year =

    Yining Wang and Adams Wei Yu and Aarti Singh , title =. Journal of Machine Learning Research , year =

  45. [45]

    2012 , publisher=

    Active learning , author=. 2012 , publisher=

  46. [46]

    Science , volume=

    Prediction-powered inference , author=. Science , volume=. 2023 , publisher=

  47. [47]

    Proceedings of the National Academy of Sciences , volume=

    Cross-prediction-powered inference , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

  48. [48]

    Advances in Neural Information Processing Systems , volume=

    Prediction-powered ranking of large language models , author=. Advances in Neural Information Processing Systems , volume=

  49. [49]

    Brown and T

    Anru Zhang and Lawrence D. Brown and T. Tony Cai , title =. The Annals of Statistics , number =

  50. [50]

    Advances in Neural Information Processing Systems , year=

    Doubly Robust Self-Training , author=. Advances in Neural Information Processing Systems , year=

  51. [51]

    Advances in Neural Information Processing Systems , volume=

    Empirical Bernstein inequalities for u-statistics , author=. Advances in Neural Information Processing Systems , volume=

  52. [52]

    Advances in Neural Information Processing Systems , volume=

    Generalization guarantee of SGD for pairwise learning , author=. Advances in Neural Information Processing Systems , volume=

  53. [53]

    Foundations and trends

    Learning to rank for information retrieval , author=. Foundations and trends. 2009 , publisher=

  54. [54]

    Seminars in nuclear medicine , volume=

    Basic principles of ROC analysis , author=. Seminars in nuclear medicine , volume=. 1978 , organization=

  55. [55]

    International Conference on Machine Learning , pages=

    One-pass AUC optimization , author=. International Conference on Machine Learning , pages=. 2013 , organization=

  56. [56]

    Journal of the American Statistical Association , volume=

    Optimal subsampling for large sample logistic regression , author=. Journal of the American Statistical Association , volume=. 2018 , publisher=

  57. [57]

    The Journal of Machine Learning Research , volume=

    A statistical perspective on algorithmic leveraging , author=. The Journal of Machine Learning Research , volume=. 2015 , publisher=

  58. [58]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , pages=

    Multi-resolution subsampling for linear classification with massive data , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , pages=. 2025 , publisher=

  59. [59]

    International Conference on Artificial Intelligence and Statistics , pages=

    Deep active learning: Unified and principled method for query and training , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

  60. [60]

    Enhancing Deep Batch Active Learning for Regression with Imperfect Data Guided Selection , volume =

    Min, Yinjie and Xu, Furong and Li, Xinyao and Zou, Changliang and Zhou, Yongdao , booktitle =. Enhancing Deep Batch Active Learning for Regression with Imperfect Data Guided Selection , volume =

  61. [61]

    , title =

    Kiefer, J. , title =. Journal of the Royal Statistical Society: Series B (Methodological) , volume =

  62. [62]

    Journal of Machine Learning Research , volume=

    Monte carlo gradient estimation in machine learning , author=. Journal of Machine Learning Research , volume=

  63. [63]

    2003 , publisher=

    Model assisted survey sampling , author=. 2003 , publisher=

  64. [64]

    Symposium on Monte Carlo Methods , editor =

    Trotter, Hale F and Tukey, John W , title =. Symposium on Monte Carlo Methods , editor =

  65. [65]

    arXiv preprint arXiv:2204.14121 , year=

    Inverse probability weighting: from survey sampling to evidence estimation , author=. arXiv preprint arXiv:2204.14121 , year=

  66. [66]

    Biometrics Bulletin , volume=

    Individual comparisons by ranking methods , author=. Biometrics Bulletin , volume=. 1945 , publisher=