arxiv: 2605.08002 · v1 · submitted 2026-05-08 · 📊 stat.ME · math.ST· stat.TH

Recognition: 2 theorem links

· Lean Theorem

Cellwise and Casewise Robust Multivariate Regression with Inference

Fabio Centofanti, Mia Hubert, Peter J. Rousseeuw

Pith reviewed 2026-05-11 02:44 UTC · model grok-4.3

classification 📊 stat.ME math.STstat.TH

keywords robust multivariate regressioncellwise outlierscasewise outliersbootstrap inferenceinfluence functionsmissing datahigh-dimensional regressionasymptotic validity

0 comments

The pith

A new estimator enables robust multivariate regression that handles both whole-observation and single-cell outliers along with missing values.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the cellMR estimator for multivariate linear regression that remains reliable when data contain both casewise outliers affecting entire rows and cellwise outliers in individual entries. It constructs this by first obtaining a cellwise robust covariance matrix and then applying ridge regularization to ensure stability in high dimensions or with missing entries. The authors pair it with cellBoot, a bootstrap method based on indirect inference that produces confidence intervals whose asymptotic validity they prove using influence functions. A sympathetic reader would care because ordinary least squares breaks down under even modest contamination, while real datasets in genomics and elsewhere routinely mix these outlier types with incompleteness. If the claims hold, analysts gain a single procedure that delivers both point estimates and inference without first having to decide which cells or rows to discard.

Core claim

The cellMR estimator simultaneously accommodates casewise and cellwise outliers, missing data, and high dimensionality in multivariate linear regression by building on a cellwise robust covariance estimator and using ridge regularization. The cellBoot procedure, based on indirect inference, provides asymptotically valid confidence intervals robust to both types of contamination, with derived influence functions supporting this.

What carries the argument

The cellwise multivariate regression (cellMR) estimator, which combines a cellwise robust covariance estimator with ridge regularization to produce regression coefficients that resist mixed outlier patterns and missing entries.

If this is right

The estimator produces stable coefficients even when the number of variables approaches or exceeds the number of observations.
cellBoot confidence intervals remain valid under simultaneous casewise and cellwise contamination.
The procedure works directly on data matrices that contain missing values without requiring separate imputation.
Influence functions quantify the effect of individual contaminated cells or rows on the fitted coefficients.
Real-data examples such as genomics applications show competitive finite-sample accuracy compared with classical methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar cellwise-robust covariance building blocks could be inserted into other multivariate techniques such as principal-component analysis or canonical correlation.
The framework suggests a path toward robust versions of regularized regression that also deliver valid inference without cross-validation tuning.
In practice this would let analysts keep more observations instead of listwise deletion, potentially increasing power in studies with incomplete records.
Extensions to time-series or spatial data might follow by adapting the cellwise contamination model to respect dependence structure.

Load-bearing premise

The cellwise robust covariance estimator must perform reliably under the paper's contamination model and the indirect-inference bootstrap must be correctly calibrated for the asymptotic validity proofs to go through.

What would settle it

Repeated simulations in which the cellBoot intervals achieve coverage well below the nominal level when 5-10 percent of cells are contaminated and some entries are missing would falsify the asymptotic-validity claim.

Figures

Figures reproduced from arXiv: 2605.08002 by Fabio Centofanti, Mia Hubert, Peter J. Rousseeuw.

**Figure 2.** Figure 2: A regression outlier map of cellMR. The size of each point is made proportional to 1− 1 d Pd j=1 mijw cell ij . A large point therefore indicates a case with many outlying cells in the predictor and/or the response. The casewise outlyingness is visualized by coloring the points according to their casewise total deviation ti of (6). The points are colored black when ti > 1.5 ct,0.99 , white when ti < ct,0.9… view at source ↗

**Figure 3.** Figure 3: cellMR predictor and residual cellmaps of the 4 labeled cases. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: The casewise (left) and cellwise (right) influence function of [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Average MSE attained by RIDGE, SEST, PENSE, CRM, REGCELL, SHOOT, and cellMR [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Coverage attained by OLS, FRB, and cellBoot for the 0 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: trimRMSE [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: cellMR forest plot with level 0.95 bootstrap confidence intervals for the gene-protein data. [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: The function ρb,c with b = 1.5 and c = 4 (top left), its derivative ψb,c (top right), its weight function used in (8) and (10) (bottom left), and the function z 7→ ρ( √ z) (bottom right). the sense that very extreme values receive zero weight in the estimation. This favorable property is not shared by the well-known Huber ρ-function (Huber, 1964), that is not suitable in our framework. Moreover, Hampel et… view at source ↗

**Figure 10.** Figure 10: The function χb,c with b = 1.5 and c = 4. 3 [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗

**Figure 11.** Figure 11: Average MSE attained by RIDGE and cellMR in the presence of cellwise outliers, casewise outliers, or both, with 10% of missing cells. 95 [PITH_FULL_IMAGE:figures/full_fig_p123_11.png] view at source ↗

**Figure 12.** Figure 12: Average coverage attained by OLS, FRB, and cellBoot for the 0 [PITH_FULL_IMAGE:figures/full_fig_p124_12.png] view at source ↗

read the original abstract

Multivariate linear regression is a fundamental statistical task, but classical estimators such as ordinary least squares are highly sensitive to outliers. These may occur as casewise outliers that affect entire observations, or as outlying cells, that are individual contaminated entries in the predictor and/or response matrix. Moreover, modern datasets frequently contain missing values and are high-dimensional. To address these challenges we propose the cellwise multivariate regression (cellMR) estimator, a robust regression method that simultaneously accommodates casewise and cellwise outliers, missing data, and high dimensionality. The approach builds on a cellwise robust covariance estimator and uses ridge regularization for numerical stability. We further introduce cellBoot, a novel bootstrap-based inference procedure tailored to the cellMR framework. Relying on indirect inference, cellBoot provides asymptotically valid confidence intervals that are robust to casewise and cellwise contamination. We derive influence functions of the regression estimator and prove the asymptotic validity of the cellBoot confidence intervals. Simulations and a real genomics application illustrate the strong finite-sample performance of the proposed methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces cellMR for robust multivariate regression handling casewise/cellwise outliers plus missing data and high dimensions, with cellBoot for inference, but the asymptotic claims rest on unverified regularity conditions.

read the letter

The paper's key advance is a robust estimator for multivariate linear regression that copes with both casewise and cellwise outliers, plus missing data and high dimensionality. It pairs this with a bootstrap method called cellBoot for robust inference. They build cellMR on a cellwise robust covariance estimator, adding ridge regularization for stability in high dimensions. cellBoot relies on indirect inference to produce confidence intervals that stay valid under contamination. The authors derive influence functions for the regression estimator and prove the asymptotic validity of those intervals. They back it up with simulations and a genomics application. This approach handles practical problems that come up in modern data analysis. The combination of robustness types and the inference tool is a step forward from separate methods. The application shows it can work on real problems. One area that needs checking is the theory. The proofs for asymptotic validity depend on the covariance estimator satisfying certain regularity conditions under the contamination model, including in high-dimensional cases with missing values. The stress-test note points out that these conditions may not be fully verified when p is large relative to n, even with ridge. Without explicit details on how the ridge affects the bounds or the rates, it's possible the bootstrap calibration could fail in some regimes. The abstract states the proofs exist, but the strength depends on how general those conditions are. Statisticians interested in robust multivariate methods would get the most from this. It targets users facing outliers and missingness in regression settings. The work shows clear thinking on the problem and engages with the literature on cellwise robustness, so it merits a serious referee to evaluate the claims. I recommend putting it through peer review. The practical contributions are there, and a referee can sort out any gaps in the asymptotic arguments.

Referee Report

1 major / 1 minor

Summary. The paper proposes the cellMR estimator for robust multivariate linear regression that simultaneously handles casewise and cellwise outliers, missing data, and high dimensionality by building on a cellwise robust covariance estimator with ridge regularization. It introduces cellBoot, an indirect-inference bootstrap procedure for asymptotically valid confidence intervals robust to contamination. The authors derive influence functions of the regression estimator and prove the asymptotic validity of the cellBoot CIs, with supporting simulations and a real genomics application.

Significance. If the theoretical claims hold, this would be a useful contribution to robust multivariate methods by extending cellwise robust covariance ideas to regression with mixed contamination, missingness, and high dimensions while providing inference. The simulations and genomics application provide concrete evidence of finite-sample behavior and practical utility.

major comments (1)

[theoretical results on asymptotic validity and influence functions] The central claim of deriving influence functions and proving asymptotic validity of cellBoot (as stated in the abstract) requires the underlying cellwise robust covariance estimator to satisfy uniform consistency, bounded eigenvalues, and appropriate rates for contamination fraction and p/n even after ridge stabilization and missing-data handling. The manuscript does not explicitly state or verify these regularity conditions in the theoretical development, which is load-bearing for the bootstrap calibration to remain valid in the p ≫ n regime.

minor comments (1)

The abstract refers to 'strong finite-sample performance' without specifying the exact performance metrics (e.g., bias, coverage rates) or contamination levels used in the simulations; adding this detail would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We appreciate the referee's thorough review and valuable feedback on our manuscript. We address the major comment regarding the theoretical results below. We believe the revisions will clarify the assumptions and strengthen the presentation of our theoretical contributions.

read point-by-point responses

Referee: [theoretical results on asymptotic validity and influence functions] The central claim of deriving influence functions and proving asymptotic validity of cellBoot (as stated in the abstract) requires the underlying cellwise robust covariance estimator to satisfy uniform consistency, bounded eigenvalues, and appropriate rates for contamination fraction and p/n even after ridge stabilization and missing-data handling. The manuscript does not explicitly state or verify these regularity conditions in the theoretical development, which is load-bearing for the bootstrap calibration to remain valid in the p ≫ n regime.

Authors: We thank the referee for highlighting this important point. The regularity conditions on the cellwise robust covariance estimator are indeed crucial for the validity of the influence function derivations and the asymptotic results for cellBoot, particularly in high-dimensional settings with ridge regularization and missing data. While some of these conditions are implicitly assumed through the properties of the base estimator (as referenced in the literature on cellwise robust methods), we agree that they should be stated explicitly to ensure the theoretical development is self-contained and transparent. In the revised manuscript, we will introduce a new subsection that explicitly lists the required regularity conditions, including uniform consistency of the covariance estimator, bounded eigenvalues (accounting for ridge stabilization), and the allowable rates for the contamination fraction and p/n ratio. We will also discuss how these are maintained under the missing-data handling procedure. This addition will not change the main results but will provide the necessary foundation for the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a new cellMR estimator built on an existing cellwise robust covariance approach plus ridge regularization, along with a new cellBoot procedure using indirect inference. It derives influence functions and proves asymptotic validity of the confidence intervals. No equations or steps in the abstract or described structure reduce the central claims to fitted parameters by construction, self-definitional loops, or load-bearing self-citations that collapse the result. The proofs rely on regularity conditions for the covariance estimator under contamination, but these are external assumptions rather than internal reductions. The derivation remains self-contained with independent content from the new methods and proofs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard regularity conditions for robust estimators and asymptotic theory plus the specific construction of the cellwise covariance estimator; no free parameters or invented physical entities are mentioned.

axioms (1)

domain assumption Standard regularity conditions for consistency and asymptotic normality of robust M-estimators
Invoked implicitly when deriving influence functions and proving asymptotic validity of cellBoot.

pith-pipeline@v0.9.0 · 5480 in / 1254 out tokens · 51148 ms · 2026-05-11T02:44:07.719359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We derive influence functions of the regression estimator and prove the asymptotic validity of the cellBoot confidence intervals.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
cellMR builds on a cellwise robust covariance estimator and uses ridge regularization

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Leung, V

Agostinelli, C., A. Leung, V. J. Yohai, and R. H. Zamar (2015). Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination. Test\/ 24 , 441--461

work page 2015
[2]

Croux, and S

Alfons, A., C. Croux, and S. Gelper (2013). Sparse least trimmed squares regression for analyzing high-dimensional large data sets . The Annals of Applied Statistics\/ 7\/ (1), 226 -- 248

work page 2013
[3]

Van Aelst, V

Alqallaf, F., S. Van Aelst, V. J. Yohai, and R. H. Zamar (2009). Propagation of outliers in multivariate data. The Annals of Statistics\/ 37 , 311--331

work page 2009
[4]

Amado, C. and A. M. Pires (2004). Robust bootstrap with non random weights based on the influence function. Communications in Statistics-Simulation and Computation\/ 33\/ (2), 377--396

work page 2004
[5]

Bickel, P. J. and D. A. Freedman (1981). Some asymptotic theory for the bootstrap. The Annals of Statistics\/ 9\/ (6), 1196--1217

work page 1981
[6]

Croux, and I

Bottmer, L., C. Croux, and I. Wilms (2022). Sparse regression for large data sets with outliers. European Journal of Operational Research\/ 297\/ (2), 782--794

work page 2022
[7]

Hubert, and P

Centofanti, F., M. Hubert, and P. J. Rousseeuw (2025). Cellwise and Casewise Robust Covariance in High Dimensions, arXiv preprint arXiv:2505.19925

work page arXiv 2025
[8]

Hubert, and P

Centofanti, F., M. Hubert, and P. J. Rousseeuw (2026). Robust Principal Components by Casewise and Cellwise Weighting . Technometrics, to appear \/ \!\! , \; https://doi.org/10.1080/00401706.2026.2643216\,

work page doi:10.1080/00401706.2026.2643216 2026
[9]

Cohen Freue, G. V., D. Kepplinger, M. Salibi \'a n-Barrera, and E. Smucler (2019). Robust elastic net estimators for variable selection and identification of proteomic biomarkers. The Annals of Applied Statistics\/ 13\/ (4), 2065--2090

work page 2019
[10]

Efron, B. and R. J. Tibshirani (1994). An Introduction to the Bootstrap . CRC press

work page 1994
[11]

H \"o ppner, I

Filzmoser, P., S. H \"o ppner, I. Ortner, S. Serneels, and T. Verdonck (2020). Cellwise robust M regression. Computational Statistics & Data Analysis\/ 147 , 106944

work page 2020
[12]

Filzmoser, P. and K. Nordhausen (2021). Robust linear regression for high-dimensional data: An overview. Wiley Interdisciplinary Reviews: Computational Statistics\/ 13\/ (4), e1524

work page 2021
[13]

Monfort, and E

Gourieroux, C., A. Monfort, and E. Renault (1993). Indirect inference. Journal of Applied Econometrics\/ 8\/ (S1), S85--S118

work page 1993
[14]

Dupuis-Lozeron , Y

Guerrier, S., E. Dupuis-Lozeron , Y. Ma, and M.-P. Victoria-Feser (2019). Simulation-based bias correction methods for complex models. Journal of the American Statistical Association\/ 114 , 146--157

work page 2019
[15]

Hampel, F. R., E. M. Ronchetti, and P. J. Rousseeuw (1981). The Change-of-Variance Curve and Optimal Redescending M-Estimators . Journal of the American Statistical Association\/ 76 , 643--648

work page 1981
[16]

Hampel, F. R., E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel (1986). Robust Statistics: the Approach based on Influence Functions . Wiley

work page 1986
[17]

Tibshirani, and J

Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction . Springer Series in Statistics. Springer

work page 2009
[18]

Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics\/ 35\/ (1), 73--101

work page 1964
[19]

Huber, P. J. (1981). Robust Statistics . John Wiley & Sons

work page 1981
[20]

Hubert, M., P. J. Rousseeuw, and T. Verdonck (2012). A deterministic algorithm for robust location and scatter. Journal of Computational and Graphical Statistics\/ 21\/ (3), 618--637

work page 2012
[21]

Kosorok, M. R. (2008). Introduction to Empirical Processes and Semiparametric Inference . Springer

work page 2008
[22]

Zhang, and R

Leung, A., H. Zhang, and R. Zamar (2016). Robust regression estimation and inference in the presence of cellwise and casewise contamination. Computational Statistics & Data Analysis\/ 99 , 1--11

work page 2016
[23]

Little, R. J. (1992). Regression with missing x's: a review. Journal of the American Statistical Association\/ 87\/ (420), 1227--1237

work page 1992
[24]

Maronna, R. A. (2011). Robust ridge regression for high-dimensional data. Technometrics\/ 53\/ (1), 44--53

work page 2011
[25]

Maronna, R. A., R. D. Martin, V. J. Yohai, and M. Salibi \'a n-Barrera (2019). Robust Statistics: T heory and Methods (with R) . John Wiley & Sons

work page 2019
[26]

Newey, W. K. and D. McFadden (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics\/ 4 , 2111--2245

work page 1994
[27]

\"O llerer, V. and C. Croux (2015). Robust high-dimensional precision matrix estimation. In Modern Nonparametric, Robust and Multivariate Methods , pp.\ 325--350. Springer

work page 2015
[28]

Raymaekers, J. and P. J. Rousseeuw (2021). Fast robust correlation for high-dimensional data. Technometrics\/ 63 , 184--198

work page 2021
[29]

Raymaekers, J. and P. J. Rousseeuw (2026). Challenges of cellwise outliers. Econometrics and Statistics\/ 38 , 6--25, DOI https://doi.org/10.1016/j.ecosta.2024.02.002

work page doi:10.1016/j.ecosta.2024.02.002 2026
[30]

Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association\/ 79\/ (388), 871--880

work page 1984
[31]

Rousseeuw, P. J. and A. Leroy (1987). Robust R egression and O utlier D etection . Wiley

work page 1987
[32]

Rousseeuw, P. J. and W. Van den Bossche (2018). Detecting deviating data cells. Technometrics\/ 60\/ (2), 135--145

work page 2018
[33]

Van Aelst, and G

Salibi \'a n-Barrera, M., S. Van Aelst, and G. Willems (2008). Fast and robust bootstrap. Statistical Methods and Applications\/ 17\/ (1), 41--71

work page 2008
[34]

Salibian-Barrera, M. and R. H. Zamar (2002). Bootstrapping robust estimates of regression. The Annals of Statistics\/ 30 , 556--582

work page 2002
[35]

Shankavaram, U. T., W. C. Reinhold, S. Nishizuka, S. Major, D. Morita, K. K. Chary, M. A. Reimers, U. Scherf, A. Kahn, D. Dolginow, et al. (2007). Transcript and protein expression profiles of the NCI-60 cancer cell panel: an integromic microarray study. Molecular Cancer Therapeutics\/ 6\/ (3), 820--832

work page 2007
[36]

Su, P., G. Tarr, S. Muller, and S. Wang (2024). CR-Lasso: Robust cellwise regularized sparse regression . Computational Statistics & Data Analysis\/ 197\/ (107971), 1--14

work page 2024
[37]

Van Aelst, S. and G. Willems (2005). Multivariate regression S -estimators for robust estimation and inference. Statistica Sinica\/ 15 , 981--1001

work page 2005
[38]

Van der Vaart, A. W. (2000). Asymptotic Statistics . Cambridge University Press

work page 2000