pith. machine review for the scientific record. sign in

arxiv: 2605.05930 · v1 · submitted 2026-05-07 · 📊 stat.ME

Recognition: unknown

Toward design-based inference for data integration

Andrius \v{C}iginas, Ieva Burakauskait\.e, Jae Kwang Kim

Pith reviewed 2026-05-08 07:41 UTC · model grok-4.3

classification 📊 stat.ME
keywords design-based inferencenon-probability sampledata integrationgeneralized regression estimatorfinite populationsequential samplingNMAR
0
0 comments X

The pith

Treating non-probability samples as certainty strata allows design-consistent finite population inference without any assumptions on their selection mechanism.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a sequential design-based approach to integrate non-probability samples with probability samples for finite population inference. In this framework, the non-probability sample is regarded as a fully observed certainty stratum, after which a probability sample is drawn from the remaining units. Two generalized regression estimators are proposed: one fitting models separately in the two strata and one pooling them. Both estimators are design-consistent and have consistent variance estimators, with no requirements on the non-probability selection process, including cases where it is not missing at random. This provides a robust alternative to methods that rely on modeling selection probabilities under unverifiable assumptions.

Core claim

The central discovery is that by using a two-phase sampling design where the non-probability sample constitutes the first phase with certainty, generalized regression estimators can be constructed that are consistent for the population total under the overall sampling design, irrespective of the mechanism that generated the non-probability sample.

What carries the argument

The sequential framework treating the observed non-probability sample as a certainty stratum from which a probability sample is subsequently drawn from the complement, enabling design-based generalized regression estimation without selection modeling.

If this is right

  • Population parameters can be estimated consistently even under NMAR selection for the non-probability data.
  • Consistent variance estimation is available directly from the design.
  • A diagnostic test helps decide between separate and combined regression based on stratum homogeneity.
  • The non-probability sample can be used to optimize the second-stage sampling probabilities under a working model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could be applied in official statistics to combine administrative data with targeted surveys.
  • It opens the door to efficiency gains by using large non-probability samples to guide where to allocate probability sampling resources.
  • Future work might extend the approach to more than two phases or incorporate multiple non-probability sources.

Load-bearing premise

After observing the non-probability sample, it must be feasible to draw a probability sample from the complementary part of the population.

What would settle it

Compare the estimator to the known population total in a simulation where the non-probability sample is selected based on the outcome variable and the probability sample from the complement is drawn with known inclusion probabilities; significant bias would falsify the consistency claim.

Figures

Figures reproduced from arXiv: 2605.05930 by Andrius \v{C}iginas, Ieva Burakauskait\.e, Jae Kwang Kim.

Figure 1
Figure 1. Figure 1: Boxplots of relative errors RE = 100(Yˆ − Y )/Y (in percent) across Monte Carlo replications under the MAR mechanism. Outlier points are suppressed. For readability, the vertical axis limits are truncated at the 0.999 quantile of |RE| in the left panel and at the 0.99 quantile in the right panel (computed within each panel), so replications outside the displayed range are not shown. 18 view at source ↗
Figure 2
Figure 2. Figure 2: Boxplots of relative errors RE = 100(Yˆ − Y )/Y (in percent) across Monte Carlo replications under the NMAR mechanism. Outlier points are suppressed. For readability, the vertical axis limits are truncated at the 0.999 quantile of |RE| in the left panel and at the 0.99 quantile in the right panel (computed within each panel), so replications outside the displayed range are not shown. test answers “is there… view at source ↗
Figure 3
Figure 3. Figure 3: Heterogeneity between the administrative pilot sample view at source ↗
Figure 4
Figure 4. Figure 4: Boxplots of relative errors RE = 100(Yˆ − Y )/Y (in percent) across Monte Carlo replications. Outlier points are suppressed. For readability, the vertical axis limits are truncated at the 0.999 quantile of |RE| in the left panel and at the 0.99 quantile in the right panel (computed within each panel), so replications outside the displayed range are not shown view at source ↗
Figure 5
Figure 5. Figure 5: Heterogeneity between the pharmacy pilot sample view at source ↗
Figure 6
Figure 6. Figure 6: Boxplots of relative errors RE = 100(Yˆ − Y )/Y (in percent) across Monte Carlo replications. Outlier points are suppressed. For readability, the vertical axis limits are truncated at the 0.999 quantile of |RE| in the left panel and at the 0.99 quantile in the right panel (computed within each panel), so replications outside the displayed range are not shown. difference between separate and combined estima… view at source ↗
read the original abstract

Integrating non-probability samples into finite-population inference typically requires modeling unknown selection probabilities under a missing-at-random (MAR) assumption that is difficult to verify. We propose a design-based alternative in which the non-probability sample is treated as a fully observed certainty stratum and a probability sample is drawn only from the complementary, previously unsampled units. Within this sequential framework, we develop two generalized regression estimators: one fitting the outcome model separately in the complementary stratum, the other pooling both samples; we make two distinct contributions. First, both estimators are design-consistent and admit consistent variance estimators with no assumption whatsoever on the non-probability selection mechanism, including under not-missing-at-random (NMAR) selection. Second, under a working superpopulation model that holds in both strata, the pilot non-probability sample can be used to construct second-stage inclusion probabilities that achieve Isaki-Fuller asymptotic optimality for the separate estimator; this optimality claim relies on assumptions strictly stronger than MAR, but its failure does not invalidate the consistency results above. A diagnostic test for coefficient homogeneity is proposed to guide the choice between the two estimators. Simulations confirm that the sequential estimators remain essentially unbiased under both MAR and NMAR, while propensity-adjusted competitors can be severely biased under NMAR. Two applications from Lithuanian official statistics illustrate that separate regression is preferable when the pilot stratum and its complement are strongly heterogeneous, whereas combined regression offers a modest efficiency gain when the two strata are similar.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 4 minor

Summary. The manuscript proposes a sequential design-based framework for finite-population inference with integrated non-probability samples. The non-probability sample is treated as a fully observed certainty stratum, after which a probability sample is drawn from the complementary units. Two generalized regression estimators are developed: one fitting the outcome model separately within the complement and one pooling both samples. Both are claimed to be design-consistent for the population total (with consistent variance estimators) under arbitrary non-probability selection mechanisms, including NMAR, with no modeling assumptions required on selection. Under a working superpopulation model, the pilot sample informs second-stage probabilities for asymptotic optimality of the separate estimator (conditional on stronger assumptions than MAR). A diagnostic test for coefficient homogeneity guides estimator choice. Simulations show near-unbiased performance under MAR and NMAR (unlike propensity methods under NMAR), and two Lithuanian official statistics applications illustrate practical use.

Significance. If the central claims hold, the work provides a valuable design-based route to data integration that avoids unverifiable MAR assumptions on unknown selection probabilities, a common practical barrier in official statistics. The design-consistency result, which follows from Horvitz-Thompson unbiasedness for the auxiliaries once the certainty stratum is fixed, is a clear strength and is supported by the simulation evidence of robustness under NMAR. The consistent variance estimators, optimality result (explicitly caveated), and homogeneity diagnostic add to the contribution's utility. The approach is grounded in finite-population principles and could meaningfully influence integration practices where a second-stage probability sample from the complement is feasible.

major comments (2)
  1. [Theoretical development of variance estimators] The design-consistency and variance-estimator consistency claims are load-bearing for the first contribution. Explicit derivations of the variance formulas (or at minimum the precise form of the variance estimator and the conditions for its consistency) should be provided in the theoretical section, as the abstract asserts consistency without assumptions on the non-probability mechanism but the auditability of this step is limited without the details.
  2. [Framework description and consistency proof] The sequential framework treats the non-probability sample as a certainty stratum with inclusion probability 1 and draws the probability sample only from the complement. While this is presented as a design choice, the manuscript should explicitly state the conditions under which such a second-stage sample can be drawn in practice and discuss any resulting limitations on applicability, because this underpins the Horvitz-Thompson property used for unbiasedness.
minor comments (4)
  1. [Abstract] The abstract states that 'we make two distinct contributions' but then folds the optimality result and diagnostic test into the narrative; rephrasing to list the contributions more crisply would improve readability.
  2. [Simulations] In the simulation section, report the exact finite-population size, sample sizes for both stages, and the precise mechanism used to generate the non-probability sample under NMAR (e.g., how the selection probabilities are constructed) to facilitate replication and verification of the unbiasedness results.
  3. [Applications] The applications section would benefit from a brief table or description of the auxiliary variables employed in the GREG calibration and the outcome of the homogeneity diagnostic test, so readers can see how the choice between separate and combined estimators was made.
  4. [Notation and definitions] Notation for the two estimators (separate vs. pooled) and for the inclusion probabilities should be introduced once and used consistently; a small notation table or clear definitions early in the methods would reduce ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the detailed suggestions for improving the manuscript. We address each major comment below and have revised the paper accordingly to enhance clarity and completeness.

read point-by-point responses
  1. Referee: [Theoretical development of variance estimators] The design-consistency and variance-estimator consistency claims are load-bearing for the first contribution. Explicit derivations of the variance formulas (or at minimum the precise form of the variance estimator and the conditions for its consistency) should be provided in the theoretical section, as the abstract asserts consistency without assumptions on the non-probability mechanism but the auditability of this step is limited without the details.

    Authors: We agree that the explicit derivations strengthen the theoretical section. In the revised manuscript, we have added a dedicated subsection (Section 3.3) that provides the full derivations of the variance estimators for both the separate and pooled generalized regression estimators. Starting from the Horvitz-Thompson unbiasedness for the auxiliary totals (which holds once the certainty stratum is fixed), we derive the exact variance expressions under the sequential design and state the conditions for design-consistency of the variance estimators without any assumptions on the non-probability selection mechanism. These derivations are now in the main text rather than relying solely on the appendix. revision: yes

  2. Referee: [Framework description and consistency proof] The sequential framework treats the non-probability sample as a certainty stratum with inclusion probability 1 and draws the probability sample only from the complement. While this is presented as a design choice, the manuscript should explicitly state the conditions under which such a second-stage sample can be drawn in practice and discuss any resulting limitations on applicability, because this underpins the Horvitz-Thompson property used for unbiasedness.

    Authors: We have revised the framework description in Section 2 to explicitly state the practical conditions required: the availability of a sampling frame for the target finite population that permits identification of the non-probability sample units so they can be excluded from the second-stage draw. We now discuss the resulting limitations, including scenarios with incomplete frames, imperfect matching between the non-probability sample and the frame, or logistical constraints on drawing from the complement. These additions clarify when the Horvitz-Thompson property applies and the scope of applicability, while noting that the design-consistency results remain valid under the stated conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims of design-consistency for both generalized regression estimators follow directly from the sequential two-stage sampling framework: non-probability units are assigned inclusion probability 1 as a certainty stratum, a probability sample is drawn from the complement, and GREG calibration ensures the design expectation equals the finite-population total via Horvitz-Thompson unbiasedness for auxiliaries. This holds conditionally on the observed stratum without reference to the non-probability selection mechanism or any fitted parameters derived from it. The optimality result for the separate estimator is explicitly conditional on a working superpopulation model and does not affect the consistency results. No self-definitional reductions, fitted-input predictions, or load-bearing self-citations appear in the load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard finite-population sampling theory and the ability to draw a probability sample from the complement; no new entities are postulated and no free parameters are fitted to achieve the consistency claims.

axioms (2)
  • standard math Finite-population design-based inference framework
    Invoked throughout as the basis for consistency without superpopulation assumptions on selection.
  • domain assumption Ability to draw a probability sample from the previously unsampled units
    Central to the sequential design; stated as a practical design choice.

pith-pipeline@v0.9.0 · 5566 in / 1382 out tokens · 27224 ms · 2026-05-08T07:41:08.945600+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references

  1. [1]

    J.\ Breidt and J

    F. J.\ Breidt and J. D.\ Opsomer. Model-assisted survey estimation with modern prediction techniques. Statistical Science. 2017

  2. [2]

    Doubly robust inference with nonprobability survey samples

    Y.\ Chen and P.\ Li and C.\ Wu. Doubly robust inference with nonprobability survey samples. Journal of the American Statistical Association. 2020

  3. [3]

    Evaluating the impact of a non-probability sample-based estimator in a linear combination with an estimator from a probability sample

    A.\ C iginas and D.\ Krapavickaitė and V.\ Nekrašaitė - Liegė. Evaluating the impact of a non-probability sample-based estimator in a linear combination with an estimator from a probability sample. Journal of Official Statistics. 2025

  4. [4]

    J.\ Carroll

    M.\ Davidian and R. J.\ Carroll. Variance function estimation. Journal of the American Statistical Association. 1987

  5. [5]

    R.\ Elliott and R.\ Valliant

    M. R.\ Elliott and R.\ Valliant. Inference for nonprobability samples. Statistical Science. 2017

  6. [6]

    A.\ Fuller

    W. A.\ Fuller. Sampling Statistics. 2009

  7. [7]

    Integrating probability and big non-probability samples data to produce O fficial S tatistics

    N.\ Golini and P.\ Righi. Integrating probability and big non-probability samples data to produce O fficial S tatistics. Statistical Methods & Applications. 2024

  8. [8]

    T.\ Isaki and W

    C. T.\ Isaki and W. A.\ Fuller. Survey design under the regression superpopulation model. Journal of the American Statistical Association. 1982

  9. [9]

    J. K.\ Kim. Statistics in Survey Sampling. 2025

  10. [10]

    K.\ Kim and S.\ Tam

    J. K.\ Kim and S.\ Tam. Data integration by combining big data and survey sample data for finite population inference. International Statistical Review. 2021

  11. [11]

    K.\ Kim and Z.\ Wang

    J. K.\ Kim and Z.\ Wang. Sampling techniques for big data analysis. International Statistical Review. 2019