Recognition: unknown
Principal Component Based Estimation of Finite Population Mean under Multicollinearity
Pith reviewed 2026-05-07 15:42 UTC · model grok-4.3
The pith
A principal component analysis estimator for the finite population mean removes multicollinearity from auxiliary variables and yields lower mean squared error than standard methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an estimator constructed from the principal components of two multicollinear auxiliary variables achieves lower mean squared error and higher relative efficiency than conventional estimators for the finite population mean. The bias and MSE expressions are obtained up to the first order of approximation, and numerical studies confirm the advantage whenever the auxiliary variables exhibit multicollinearity as measured by variance inflation factors, condition indices, and eigenvalues.
What carries the argument
The PCA-based estimator formed by regressing the study variable on the orthogonal principal components extracted from the auxiliary variables, thereby eliminating multicollinearity while retaining auxiliary information.
If this is right
- The estimator remains stable and efficient even when the auxiliary variables are highly intercorrelated.
- First-order approximations supply usable expressions for bias and mean squared error that match simulation behavior for moderate sample sizes.
- The method produces higher percentage relative efficiency than ratio, product, and regression estimators specifically under multicollinearity.
- Variance inflation factors and condition indices can be used beforehand to detect the multicollinearity that the PCA transformation is designed to address.
Where Pith is reading between the lines
- The same transformation could be applied when more than two auxiliary variables are available by retaining the leading principal components that explain most variance.
- The approach suggests a general route for dimension reduction in survey sampling whenever high-dimensional auxiliary data exhibit collinearity.
- Performance under complex sampling designs such as stratified or cluster sampling remains unexamined and could be tested next.
Load-bearing premise
The principal components must capture enough of the original auxiliary information that the resulting estimator actually improves efficiency rather than discarding useful signal.
What would settle it
A Monte Carlo experiment with two auxiliary variables having correlation above 0.9, in which the proposed estimator records higher mean squared error than the ordinary regression estimator across repeated samples, would falsify the performance advantage.
Figures
read the original abstract
Auxiliary information is frequently utilized in survey sampling to improve the efficiency of estimators of the finite population mean. However, the simultaneous use of multiple auxiliary variables often induces multicollinearity, which adversely affects the stability and performance of conventional estimators. To address this issue, the present study proposes a principal component analysis (PCA) based estimation approach for the finite population mean in the presence of multicollinearity between two auxiliary variables. The proposed methodology transforms the correlated auxiliary variables into a set of orthogonal principal components, thereby removing the effect of multicollinearity while preserving the essential information contained in the auxiliary variables. An efficient estimator is then constructed using these components under simple random sampling without replacement. The bias and mean square error (MSE) of the proposed estimator are derived up to the first order of approximation. The performance of the proposed estimator is evaluated through both empirical and simulation studies under varying correlation structures. Moreover, the presence of multicollinearity is evaluated using variance inflation factors, condition indices, and eigenvalues. The results from empirical and simulation studies demonstrate that the proposed PCA-based estimator outperforms several conventional estimators in terms of MSE and percentage relative efficiency (PRE) when multicollinearity exists, ensuring robust and efficient estimation of the population mean.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a PCA-based estimator for the finite population mean under simple random sampling without replacement when two auxiliary variables are multicollinear. It transforms the auxiliaries into orthogonal principal components, constructs a regression-type estimator using the first few components, derives approximate bias and MSE expressions to first order of approximation, and reports that the new estimator attains lower MSE and higher percentage relative efficiency than conventional ratio, regression, and product estimators in both an empirical population and Monte Carlo simulations across varying correlation structures. Multicollinearity is diagnosed via VIF, condition indices, and eigenvalues.
Significance. If the first-order MSE derivation is shown to be valid after accounting for the sampling variability of the estimated principal components, the work would supply a practical, easily implemented tool for survey practitioners facing multicollinear auxiliaries. The combination of explicit analytic approximations with both empirical and simulation evidence is a strength; however, the central performance claims rest on the accuracy of those approximations and the generality of the simulation design.
major comments (3)
- [§4] §4 (Derivation of bias and MSE): The first-order Taylor expansion for the MSE of the proposed estimator conditions on the sample auxiliary matrix X and treats the eigenvectors (loadings) as fixed. This omits the additional variance and covariance terms that arise because the principal components themselves are estimated from the same sample; the resulting MSE expression therefore cannot be directly compared with the MSEs of conventional estimators that do not involve data-driven orthogonalization.
- [Table 2] Table 2 (simulation results, high-correlation regime): The reported MSE advantage of the PCA estimator is largest precisely when the two auxiliaries are most collinear, yet the simulation design fixes the population correlation matrix and draws only the y-values; it does not vary the sampling variability of the estimated PCs across replicates. Consequently the empirical superiority may be an artifact of the chosen correlation regimes rather than a general property.
- [§5.1] §5.1 (empirical study): The single real population is analyzed with only two auxiliaries; no sensitivity check is provided for the number of retained principal components or for the effect of replacing the sample eigenvectors with population eigenvectors. This leaves open whether the reported PRE gains are robust to the choice of truncation rule.
minor comments (2)
- [§3] Notation: The symbol for the estimated principal-component scores is introduced without an explicit definition linking it to the sample covariance matrix; a short display equation would remove ambiguity.
- [Figure 1] Figure 1 (eigenvalue plot): The vertical axis label is missing units and the legend does not distinguish sample from population eigenvalues.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, providing our responses and indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: §4 (Derivation of bias and MSE): The first-order Taylor expansion for the MSE of the proposed estimator conditions on the sample auxiliary matrix X and treats the eigenvectors (loadings) as fixed. This omits the additional variance and covariance terms that arise because the principal components themselves are estimated from the same sample; the resulting MSE expression therefore cannot be directly compared with the MSEs of conventional estimators that do not involve data-driven orthogonalization.
Authors: We acknowledge that the first-order approximation conditions on the realized sample auxiliary matrix X and treats the estimated eigenvectors as fixed. This approach follows standard practice in survey sampling for deriving approximate MSE expressions when sample-based transformations are involved, as the additional variability from eigenvector estimation enters at higher order for large samples. We agree that a fully unconditional MSE would include extra terms, but the conditional form remains useful for comparison under the same sampling design. We will revise §4 to explicitly note the conditional nature of the approximation and discuss its implications relative to conventional estimators. revision: partial
-
Referee: Table 2 (simulation results, high-correlation regime): The reported MSE advantage of the PCA estimator is largest precisely when the two auxiliaries are most collinear, yet the simulation design fixes the population correlation matrix and draws only the y-values; it does not vary the sampling variability of the estimated PCs across replicates. Consequently the empirical superiority may be an artifact of the chosen correlation regimes rather than a general property.
Authors: In the Monte Carlo design, a fixed finite population is generated with the specified correlation structure among auxiliaries; each replicate then draws an SRSWOR sample, so both the sample auxiliary matrix and the study variable values vary across replicates. The principal components are therefore re-estimated from the sample auxiliaries in every replicate, incorporating their sampling variability. The y-values are generated conditionally on the auxiliaries to control the correlation with the study variable. We will revise the simulation section to clarify this design explicitly and confirm that the reported gains reflect the full sampling process rather than fixed PCs. revision: yes
-
Referee: §5.1 (empirical study): The single real population is analyzed with only two auxiliaries; no sensitivity check is provided for the number of retained principal components or for the effect of replacing the sample eigenvectors with population eigenvectors. This leaves open whether the reported PRE gains are robust to the choice of truncation rule.
Authors: With only two auxiliary variables, the number of retained components is limited (typically one or two, selected via eigenvalues or variance explained). We agree that a sensitivity analysis would strengthen the empirical results. We will revise §5.1 to include checks for alternative truncation rules and, where possible, compare sample-based eigenvectors with population eigenvectors to assess robustness of the PRE gains. revision: yes
Circularity Check
No circularity: estimator definition, first-order MSE, and external validation are independent
full rationale
The paper defines the PCA-based estimator explicitly via orthogonal transformation of the sample auxiliary matrix, derives bias/MSE approximations under standard first-order Taylor expansion for survey estimators, and evaluates performance via separate empirical data and Monte Carlo simulations under controlled correlation structures. No equation reduces a claimed result to a fitted parameter or self-citation by construction; the outperformance claim rests on numerical comparisons rather than algebraic identity. The derivation chain is self-contained against the stated assumptions and does not invoke load-bearing self-citations or rename known results.
Axiom & Free-Parameter Ledger
axioms (3)
- domain assumption The sampling is simple random sampling without replacement (SRSWOR).
- domain assumption Bias and mean square error can be approximated to the first order of approximation.
- domain assumption Principal components preserve the essential information from the auxiliary variables.
Reference graph
Works this paper leans on
-
[1]
Bahl, S., & Tuteja, R. K. (1991). Ratio and product type exponential estimators. Journal of Information and Optimization Sciences, 12(1), 159 –164. https://doi.org/10.1080/02522667.1991.10699058
-
[2]
Isaki, C. T., & Fuller, W. A. (1982). Survey design under the regression superpopulation model. Journal of the American Stati stical Association, 77(377), 89 –96. https://doi.org/10.1080/01621459.1982.10477770
-
[3]
E., Swensson, B., & Wretman, J
Särndal, C. E., Swensson, B., & Wretman, J. (2003). Model assisted survey sampling. Springer Science & Business Media
2003
-
[4]
Kadilar, C., & Cingi, H. (2004). Ratio estimators in simple random sampling. Applied Mathematics and Computation, 151(3), 893 –902. https://doi.org/10.1016/S0096 - 3003(03)00803-8
-
[5]
P., Upadhyaya, L
Singh, H. P., Upadhyaya, L. N., & Tailor, R. (2009). Ratio -cum-product type exponential estimator. Statistica, 69(4), 299–310
2009
-
[6]
Sajjad, M., & Ismail, M. (2024). Efficient generalized estimators of population mean in the presence of non-response and measurement error. Kuwait Journal of Science, 51(3), 100224
2024
-
[7]
Shabbir, J., & Gupta, S. (2010). On estimating fi nite population mean in simple and stratified random sampling. Commun. Stat. - Theory Methods, 40(2), 199 –212. https://doi.org/10.1080/03610920903411259
-
[8]
Singh, R., Chauhan, P., Sawan, N., & Smarandache, F. (2007). Improvement in estimating the population mean using exponential estimator in simple random sampling. In Auxiliary Information and a Priori Values in Construction of Improved Estimators (V ol. 33)
2007
-
[9]
N., & Singh, H
Upadhyaya, L. N., & Singh, H. P. (1999). Use of transformed auxiliary variable in estimating the finite population mean. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 41(5), 627-636
1999
-
[10]
Singh, R., Kumari, A., Smarandache, F., & Tiwari, S. N. (2025). Construction of almost unbiased estimator for population mean using neu trosophic information. Neutrosophic Sets and Systems, 76, 449–463. https: //doi.org/10.5281/zenodo.14010268
-
[11]
Singh, R., & Tiwari, S. N. (2025). Improved estimator for population mean utilizing known medians of two auxiliary variables under neutrosoph ic framework. Neutrosophic systems with applications, 25(1), 38–52
2025
-
[12]
Singh, R., Kumari, A., Dubey, S., & Tiwari, S. N. (2025). Some novel sine -type estimators for finite population mean utilizing known auxiliary information. Quality & Quantity. https://doi.org/10.1007/s11135-025-02347-9
-
[13]
Singh, P., Singh, A., & Sharma, P. (2025). A new class of logarithmic estimators using subsidiary information: Real-world applications and simulation insights. Sankhya B, 1 – 27
2025
-
[14]
Raj, D. (1965). On a method of using multi -auxiliary information in sample surveys. Journal of the American Statistical Association, 60(309), 270 –277. https://doi.org/10.1080/01621459.1965.10480789
-
[15]
Ahmad, Z., Hanif, M., & Maqsood, I. (2013). Generalized estimator of population mean for two-phase sampling using multi -auxiliary variables in the presence of non -response at first phase for no information case. Pakistan Journal of Statistics, 29(2)
2013
-
[16]
M., Albalawi, O., & Afzal, A
Sher, K., Ameeq, M., Hassan, M. M., Albalawi, O., & Afzal, A. (2024). Development of improved estimators of finite population mean in simple random sampling with dual auxiliaries and its application to real world problems. Heliyon, 10(10)
2024
-
[17]
A., et al
Almulhim, F. A., et al. (2024). Estimation of finite population mean using dual auxiliary information under non-response with simple random sampling. Alexandria Engineering Journal, 100, 286–299
2024
-
[18]
Kmenta, J., & Klein, L. R. (1971). Elements of econometrics (V ol. 655). New York: Macmillan
1971
-
[19]
Gujarati, D. N. (2012). Basic Econometrics 4th ed
2012
-
[20]
Jolliffe, I. (2025). Principal component analysis. In M. Lovric (Ed.), International encyclopedia of statistical science (pp. 1945 –1948). Springer. doi : 10.1007/978-3-662- 69359-9_483
-
[21]
Singh, D., & Chaudhary, F. S. (1986). Theory and ana lysis of sample survey designs. (No Title)
1986
-
[22]
Singh, M. P. (1967). Ratio cum product method of estimation. Metrika, 12(1), 34 –42. https://doi.org/10.1007/BF02613481
-
[23]
Nasir, U., & Ahmad, Z. (2025). Enhancing population mean estimation through PCA- based estimators: Addressing multicollinearity and dimensionality challenges in survey sampling. Pakistan Journal of Statistics, 41(3)
2025
-
[24]
S., & Lahiri, K
Maddala, G. S., & Lahiri, K. (1992). Introduction to econometrics (V ol. 2, p. 525). New York: Macmillan
1992
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.