Cluster-level cross-fitting restores valid coverage for survey-weighted TMLE with flexible learners under stratified multistage designs, while single-fit and internal cross-validation versions under-cover.
Machine learning methods for finite population parameter estimation in survey sampling
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
This pedagogical review examines the use of machine learning methods in finite-population inference for survey sampling, with an emphasis on design-based validity and statistical inference. While flexible prediction tools offer substantial gains in estimation accuracy, they also introduce important challenges, primarily due to the dependence between the fitted predictors and the sample. We focus on settings in which such predictions enter survey estimation through model-assisted estimation, item nonresponse imputation, and unit nonresponse adjustment. For model-assisted estimation and item nonresponse, we show how cross-fitting and Neyman-orthogonal estimating equations can adapt ideas from double/debiased machine learning to survey data, allowing the use of high-dimensional or nonparametric learners while preserving root-n consistency and asymptotic normality under suitable conditions. In contrast, for unit nonresponse, standard inverse-probability weighting remains outcome-agnostic and operationally attractive, but this same feature makes doubly robust and orthogonal constructions harder to deploy in official statistics. We also briefly discuss related developments in small area estimation and probability/nonprobability data integration. Overall, the paper highlights both the promise of machine learning and the fundamental inferential challenges it raises for survey practice.
fields
stat.ME 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Cross-Fitted Survey-Weighted TMLE with Design-Based Variance for Causal Machine Learning
Cluster-level cross-fitting restores valid coverage for survey-weighted TMLE with flexible learners under stratified multistage designs, while single-fit and internal cross-validation versions under-cover.