pith. machine review for the scientific record. sign in

arxiv: 2604.25966 · v1 · submitted 2026-04-28 · 📊 stat.ME · math.ST· stat.TH

Recognition: unknown

Principal Component Based Estimation of Finite Population Mean under Multicollinearity

Rajesh Singh , Shobh Nath Tiwari

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:42 UTC · model grok-4.3

classification 📊 stat.ME math.STstat.TH
keywords principal component analysisfinite population meanmulticollinearityauxiliary variablessurvey samplingmean square errorrelative efficiencysimple random sampling
0
0 comments X

The pith

A principal component analysis estimator for the finite population mean removes multicollinearity from auxiliary variables and yields lower mean squared error than standard methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an approach to estimate the mean of a finite population when two auxiliary variables are correlated, a situation that destabilizes conventional estimators such as ratio or regression estimators. It applies principal component analysis to replace the original correlated auxiliaries with a smaller set of uncorrelated components that preserve most of the information relevant to the study variable. Bias and mean squared error of the resulting estimator are derived to first-order approximation under simple random sampling without replacement. Empirical data and Monte Carlo simulations then compare the new estimator against several classical alternatives across different correlation levels, showing gains in mean squared error and percentage relative efficiency precisely when multicollinearity is present.

Core claim

The central claim is that an estimator constructed from the principal components of two multicollinear auxiliary variables achieves lower mean squared error and higher relative efficiency than conventional estimators for the finite population mean. The bias and MSE expressions are obtained up to the first order of approximation, and numerical studies confirm the advantage whenever the auxiliary variables exhibit multicollinearity as measured by variance inflation factors, condition indices, and eigenvalues.

What carries the argument

The PCA-based estimator formed by regressing the study variable on the orthogonal principal components extracted from the auxiliary variables, thereby eliminating multicollinearity while retaining auxiliary information.

If this is right

  • The estimator remains stable and efficient even when the auxiliary variables are highly intercorrelated.
  • First-order approximations supply usable expressions for bias and mean squared error that match simulation behavior for moderate sample sizes.
  • The method produces higher percentage relative efficiency than ratio, product, and regression estimators specifically under multicollinearity.
  • Variance inflation factors and condition indices can be used beforehand to detect the multicollinearity that the PCA transformation is designed to address.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same transformation could be applied when more than two auxiliary variables are available by retaining the leading principal components that explain most variance.
  • The approach suggests a general route for dimension reduction in survey sampling whenever high-dimensional auxiliary data exhibit collinearity.
  • Performance under complex sampling designs such as stratified or cluster sampling remains unexamined and could be tested next.

Load-bearing premise

The principal components must capture enough of the original auxiliary information that the resulting estimator actually improves efficiency rather than discarding useful signal.

What would settle it

A Monte Carlo experiment with two auxiliary variables having correlation above 0.9, in which the proposed estimator records higher mean squared error than the ordinary regression estimator across repeated samples, would falsify the performance advantage.

Figures

Figures reproduced from arXiv: 2604.25966 by Rajesh Singh, Shobh Nath Tiwari.

Figure 1
Figure 1. Figure 1: Flowchart of the proposed PCA-based estimation procedure. The flowchart presents the sequential steps involved in constructing the proposed estimator and addressing multicollinearity through the application of PCA. 5. Quantitative Assessment In this section, the performance of the proposed estimator (𝑡 ∗ ) and the PCA-based estimator (𝑡𝑃𝐶𝐴) is examined using real dataset. The efficiency of these estimators… view at source ↗
read the original abstract

Auxiliary information is frequently utilized in survey sampling to improve the efficiency of estimators of the finite population mean. However, the simultaneous use of multiple auxiliary variables often induces multicollinearity, which adversely affects the stability and performance of conventional estimators. To address this issue, the present study proposes a principal component analysis (PCA) based estimation approach for the finite population mean in the presence of multicollinearity between two auxiliary variables. The proposed methodology transforms the correlated auxiliary variables into a set of orthogonal principal components, thereby removing the effect of multicollinearity while preserving the essential information contained in the auxiliary variables. An efficient estimator is then constructed using these components under simple random sampling without replacement. The bias and mean square error (MSE) of the proposed estimator are derived up to the first order of approximation. The performance of the proposed estimator is evaluated through both empirical and simulation studies under varying correlation structures. Moreover, the presence of multicollinearity is evaluated using variance inflation factors, condition indices, and eigenvalues. The results from empirical and simulation studies demonstrate that the proposed PCA-based estimator outperforms several conventional estimators in terms of MSE and percentage relative efficiency (PRE) when multicollinearity exists, ensuring robust and efficient estimation of the population mean.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a PCA-based estimator for the finite population mean under simple random sampling without replacement when two auxiliary variables are multicollinear. It transforms the auxiliaries into orthogonal principal components, constructs a regression-type estimator using the first few components, derives approximate bias and MSE expressions to first order of approximation, and reports that the new estimator attains lower MSE and higher percentage relative efficiency than conventional ratio, regression, and product estimators in both an empirical population and Monte Carlo simulations across varying correlation structures. Multicollinearity is diagnosed via VIF, condition indices, and eigenvalues.

Significance. If the first-order MSE derivation is shown to be valid after accounting for the sampling variability of the estimated principal components, the work would supply a practical, easily implemented tool for survey practitioners facing multicollinear auxiliaries. The combination of explicit analytic approximations with both empirical and simulation evidence is a strength; however, the central performance claims rest on the accuracy of those approximations and the generality of the simulation design.

major comments (3)
  1. [§4] §4 (Derivation of bias and MSE): The first-order Taylor expansion for the MSE of the proposed estimator conditions on the sample auxiliary matrix X and treats the eigenvectors (loadings) as fixed. This omits the additional variance and covariance terms that arise because the principal components themselves are estimated from the same sample; the resulting MSE expression therefore cannot be directly compared with the MSEs of conventional estimators that do not involve data-driven orthogonalization.
  2. [Table 2] Table 2 (simulation results, high-correlation regime): The reported MSE advantage of the PCA estimator is largest precisely when the two auxiliaries are most collinear, yet the simulation design fixes the population correlation matrix and draws only the y-values; it does not vary the sampling variability of the estimated PCs across replicates. Consequently the empirical superiority may be an artifact of the chosen correlation regimes rather than a general property.
  3. [§5.1] §5.1 (empirical study): The single real population is analyzed with only two auxiliaries; no sensitivity check is provided for the number of retained principal components or for the effect of replacing the sample eigenvectors with population eigenvectors. This leaves open whether the reported PRE gains are robust to the choice of truncation rule.
minor comments (2)
  1. [§3] Notation: The symbol for the estimated principal-component scores is introduced without an explicit definition linking it to the sample covariance matrix; a short display equation would remove ambiguity.
  2. [Figure 1] Figure 1 (eigenvalue plot): The vertical axis label is missing units and the legend does not distinguish sample from population eigenvalues.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, providing our responses and indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: §4 (Derivation of bias and MSE): The first-order Taylor expansion for the MSE of the proposed estimator conditions on the sample auxiliary matrix X and treats the eigenvectors (loadings) as fixed. This omits the additional variance and covariance terms that arise because the principal components themselves are estimated from the same sample; the resulting MSE expression therefore cannot be directly compared with the MSEs of conventional estimators that do not involve data-driven orthogonalization.

    Authors: We acknowledge that the first-order approximation conditions on the realized sample auxiliary matrix X and treats the estimated eigenvectors as fixed. This approach follows standard practice in survey sampling for deriving approximate MSE expressions when sample-based transformations are involved, as the additional variability from eigenvector estimation enters at higher order for large samples. We agree that a fully unconditional MSE would include extra terms, but the conditional form remains useful for comparison under the same sampling design. We will revise §4 to explicitly note the conditional nature of the approximation and discuss its implications relative to conventional estimators. revision: partial

  2. Referee: Table 2 (simulation results, high-correlation regime): The reported MSE advantage of the PCA estimator is largest precisely when the two auxiliaries are most collinear, yet the simulation design fixes the population correlation matrix and draws only the y-values; it does not vary the sampling variability of the estimated PCs across replicates. Consequently the empirical superiority may be an artifact of the chosen correlation regimes rather than a general property.

    Authors: In the Monte Carlo design, a fixed finite population is generated with the specified correlation structure among auxiliaries; each replicate then draws an SRSWOR sample, so both the sample auxiliary matrix and the study variable values vary across replicates. The principal components are therefore re-estimated from the sample auxiliaries in every replicate, incorporating their sampling variability. The y-values are generated conditionally on the auxiliaries to control the correlation with the study variable. We will revise the simulation section to clarify this design explicitly and confirm that the reported gains reflect the full sampling process rather than fixed PCs. revision: yes

  3. Referee: §5.1 (empirical study): The single real population is analyzed with only two auxiliaries; no sensitivity check is provided for the number of retained principal components or for the effect of replacing the sample eigenvectors with population eigenvectors. This leaves open whether the reported PRE gains are robust to the choice of truncation rule.

    Authors: With only two auxiliary variables, the number of retained components is limited (typically one or two, selected via eigenvalues or variance explained). We agree that a sensitivity analysis would strengthen the empirical results. We will revise §5.1 to include checks for alternative truncation rules and, where possible, compare sample-based eigenvectors with population eigenvectors to assess robustness of the PRE gains. revision: yes

Circularity Check

0 steps flagged

No circularity: estimator definition, first-order MSE, and external validation are independent

full rationale

The paper defines the PCA-based estimator explicitly via orthogonal transformation of the sample auxiliary matrix, derives bias/MSE approximations under standard first-order Taylor expansion for survey estimators, and evaluates performance via separate empirical data and Monte Carlo simulations under controlled correlation structures. No equation reduces a claimed result to a fitted parameter or self-citation by construction; the outperformance claim rests on numerical comparisons rather than algebraic identity. The derivation chain is self-contained against the stated assumptions and does not invoke load-bearing self-citations or rename known results.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

Based on the abstract, the approach relies on standard assumptions in finite population sampling and PCA. No free parameters or new entities are explicitly mentioned. Full text may reveal additional details or fitted elements in the estimator.

axioms (3)
  • domain assumption The sampling is simple random sampling without replacement (SRSWOR).
    Explicitly stated as the sampling design for constructing the estimator.
  • domain assumption Bias and mean square error can be approximated to the first order of approximation.
    Used for deriving the properties of the proposed estimator.
  • domain assumption Principal components preserve the essential information from the auxiliary variables.
    Core to the methodology for removing multicollinearity while retaining information.

pith-pipeline@v0.9.0 · 5511 in / 1439 out tokens · 54603 ms · 2026-05-07T15:42:56.399739+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 9 canonical work pages

  1. [1]

    Bahl, S., & Tuteja, R. K. (1991). Ratio and product type exponential estimators. Journal of Information and Optimization Sciences, 12(1), 159 –164. https://doi.org/10.1080/02522667.1991.10699058

  2. [2]

    T., & Fuller, W

    Isaki, C. T., & Fuller, W. A. (1982). Survey design under the regression superpopulation model. Journal of the American Stati stical Association, 77(377), 89 –96. https://doi.org/10.1080/01621459.1982.10477770

  3. [3]

    E., Swensson, B., & Wretman, J

    Särndal, C. E., Swensson, B., & Wretman, J. (2003). Model assisted survey sampling. Springer Science & Business Media

  4. [4]

    Kadilar, C., & Cingi, H. (2004). Ratio estimators in simple random sampling. Applied Mathematics and Computation, 151(3), 893 –902. https://doi.org/10.1016/S0096 - 3003(03)00803-8

  5. [5]

    P., Upadhyaya, L

    Singh, H. P., Upadhyaya, L. N., & Tailor, R. (2009). Ratio -cum-product type exponential estimator. Statistica, 69(4), 299–310

  6. [6]

    Sajjad, M., & Ismail, M. (2024). Efficient generalized estimators of population mean in the presence of non-response and measurement error. Kuwait Journal of Science, 51(3), 100224

  7. [7]

    Shabbir, J., & Gupta, S. (2010). On estimating fi nite population mean in simple and stratified random sampling. Commun. Stat. - Theory Methods, 40(2), 199 –212. https://doi.org/10.1080/03610920903411259

  8. [8]

    Singh, R., Chauhan, P., Sawan, N., & Smarandache, F. (2007). Improvement in estimating the population mean using exponential estimator in simple random sampling. In Auxiliary Information and a Priori Values in Construction of Improved Estimators (V ol. 33)

  9. [9]

    N., & Singh, H

    Upadhyaya, L. N., & Singh, H. P. (1999). Use of transformed auxiliary variable in estimating the finite population mean. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 41(5), 627-636

  10. [10]

    Singh, R., Kumari, A., Smarandache, F., & Tiwari, S. N. (2025). Construction of almost unbiased estimator for population mean using neu trosophic information. Neutrosophic Sets and Systems, 76, 449–463. https: //doi.org/10.5281/zenodo.14010268

  11. [11]

    Singh, R., & Tiwari, S. N. (2025). Improved estimator for population mean utilizing known medians of two auxiliary variables under neutrosoph ic framework. Neutrosophic systems with applications, 25(1), 38–52

  12. [12]

    Singh, R., Kumari, A., Dubey, S., & Tiwari, S. N. (2025). Some novel sine -type estimators for finite population mean utilizing known auxiliary information. Quality & Quantity. https://doi.org/10.1007/s11135-025-02347-9

  13. [13]

    Singh, P., Singh, A., & Sharma, P. (2025). A new class of logarithmic estimators using subsidiary information: Real-world applications and simulation insights. Sankhya B, 1 – 27

  14. [14]

    Raj, D. (1965). On a method of using multi -auxiliary information in sample surveys. Journal of the American Statistical Association, 60(309), 270 –277. https://doi.org/10.1080/01621459.1965.10480789

  15. [15]

    Ahmad, Z., Hanif, M., & Maqsood, I. (2013). Generalized estimator of population mean for two-phase sampling using multi -auxiliary variables in the presence of non -response at first phase for no information case. Pakistan Journal of Statistics, 29(2)

  16. [16]

    M., Albalawi, O., & Afzal, A

    Sher, K., Ameeq, M., Hassan, M. M., Albalawi, O., & Afzal, A. (2024). Development of improved estimators of finite population mean in simple random sampling with dual auxiliaries and its application to real world problems. Heliyon, 10(10)

  17. [17]

    A., et al

    Almulhim, F. A., et al. (2024). Estimation of finite population mean using dual auxiliary information under non-response with simple random sampling. Alexandria Engineering Journal, 100, 286–299

  18. [18]

    Kmenta, J., & Klein, L. R. (1971). Elements of econometrics (V ol. 655). New York: Macmillan

  19. [19]

    Gujarati, D. N. (2012). Basic Econometrics 4th ed

  20. [20]

    Jolliffe, I. (2025). Principal component analysis. In M. Lovric (Ed.), International encyclopedia of statistical science (pp. 1945 –1948). Springer. doi : 10.1007/978-3-662- 69359-9_483

  21. [21]

    Singh, D., & Chaudhary, F. S. (1986). Theory and ana lysis of sample survey designs. (No Title)

  22. [22]

    Singh, M. P. (1967). Ratio cum product method of estimation. Metrika, 12(1), 34 –42. https://doi.org/10.1007/BF02613481

  23. [23]

    Nasir, U., & Ahmad, Z. (2025). Enhancing population mean estimation through PCA- based estimators: Addressing multicollinearity and dimensionality challenges in survey sampling. Pakistan Journal of Statistics, 41(3)

  24. [24]

    S., & Lahiri, K

    Maddala, G. S., & Lahiri, K. (1992). Introduction to econometrics (V ol. 2, p. 525). New York: Macmillan