arxiv: 2604.03541 · v2 · submitted 2026-04-04 · 💻 cs.LG · stat.ML

Recognition: no theorem link

Choosing the Right Regularizer for Applied ML: Simulation Benchmarks of Popular Scikit-learn Regularization Frameworks

Ahsaas Bajaj, Benjamin S. Knight

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:54 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords regularizationridgelassoelastic netmulticollinearityfeature selectionsimulation studymachine learning

0 comments

The pith

When samples greatly outnumber features, Ridge, Lasso, and ElasticNet deliver similar prediction accuracy, yet Lasso recall collapses under multicollinearity while ElasticNet remains stable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper traces regularization from early stepwise methods to modern penalties and then benchmarks four scikit-learn frameworks across 134400 simulations. These simulations are organized along a seven-dimensional parameter space drawn from eight real production models. When the sample-to-feature ratio reaches at least 78, the three main methods produce nearly identical prediction performance. Under high feature correlation and low signal, however, Lasso recall falls to 0.18 while ElasticNet holds at 0.93, leading the authors to caution against Lasso or Post-Lasso OLS in such regimes. The work ends with a decision guide keyed to observable data attributes such as condition number and sample size.

Core claim

Across 134400 simulations grounded in eight production-grade models, Ridge, Lasso, and ElasticNet achieve comparable prediction accuracy once n/p >= 78. Lasso recall, by contrast, drops sharply to 0.18 at high condition numbers and low SNR, while ElasticNet maintains recall of 0.93. The authors therefore recommend against Lasso or Post-Lasso OLS when multicollinearity is strong and samples are limited, and supply an objective guide for selecting among the four frameworks based on measurable feature-space properties.

What carries the argument

A seven-dimensional simulation manifold of 134400 runs that systematically varies condition number, SNR, sample-to-feature ratio and related factors drawn from real ML models to benchmark Ridge, Lasso, ElasticNet and Post-Lasso OLS.

If this is right

With n/p at least 78, any of Ridge, Lasso or ElasticNet can be used for prediction without large accuracy differences.
Lasso should be avoided for feature selection whenever features are strongly correlated, because its recall collapses under those conditions.
ElasticNet offers more reliable variable recovery than pure Lasso when multicollinearity and noise are both present.
Post-Lasso OLS inherits the same fragility as Lasso and is likewise unsuitable at high condition numbers with modest sample sizes.
A decision procedure based on observable quantities such as kappa and n/p can replace ad-hoc choice among the four frameworks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed robustness gap suggests that mixed L1-L2 penalties provide a practical hedge against the correlation structures common in tabular production data.
Extending the same simulation grid to streaming or non-stationary settings would test whether the reported interchangeability for prediction still holds over time.
The fragility finding supplies a concrete reason to prefer ElasticNet or Ridge when downstream interpretability depends on stable feature recovery.
Running the benchmarks with alternative solvers or cross-validation schemes inside scikit-learn could reveal whether implementation details modulate the reported recall gap.

Load-bearing premise

The seven-dimensional manifold of simulation parameters drawn from eight production models is sufficient to represent the factors that govern regularization behavior in actual applied work.

What would settle it

Replicate the high-kappa low-SNR slice of the simulation grid on a real dataset whose measured condition number and signal-to-noise ratio fall in the same regime and check whether Lasso recall indeed falls near 0.18.

Figures

Figures reproduced from arXiv: 2604.03541 by Ahsaas Bajaj, Benjamin S. Knight.

**Figure 1.** Figure 1: The three norms as applied to regularization. Regularization using the ℓ0 norm (solid line) performs discrete variable selection, keeping only the larger coefficient while setting the smaller one to zero. Lasso regularization (thick dashed line) via the ℓ1 norm nudges βˆ towards the axes, thus tending towards partial sparsity with some shrinkage of both coefficients. Ridge regression (dotted circular line)… view at source ↗

**Figure 2.** Figure 2: Distribution parameters for Eigenvalues and β coefficients. 4.2 Determination of the Effect Sizes Effect sizes (β) are determined by our 4th hyper-parameter - β distribution. We assume that within a typical productionized model training set, most features are weak, a few matter, and the magnitude of feature importance decays smoothly. This is our richest hyper-parameter, comprising of 5 distribution / para… view at source ↗

**Figure 3.** Figure 3: We selected a wide range of potential α values to accommodate both RidgeCV and LassoCV. Note how the optimal L1 parameter tends to fall at the extremes (i.e. 0.0 or 1.0). 5.2 Coefficient Recovery (Precision / Recall) When the researcher knows a priori that all features have a non-zero value for β, Ridge will yield a recall of 100% by design. However, a more common scenario is the classic precision/recall t… view at source ↗

**Figure 4.** Figure 4: The precision with which Lasso is able recover the true, non-zero β coefficients is a function of the signal-to-noise ratio within the feature set. Conversely, ElasticNet acts as a vital compromise. It preserves the ‘grouping effect’ necessary to maintain high recall (> 0.83 even in high-dispersion, low-SNR settings) without the total loss of discernment seen in Ridge. Furthermore, the Eigenvalue Dispersio… view at source ↗

**Figure 5.** Figure 5: Post-Lasso OLS has the largest L2 error across the 8 n/p ratios simulated. In summary, researchers interested in optimizing for the accuracy of coefficient estimates are advised to start by evaluating the eigenspace of the feature set. Post-lasso OLS is only likely to be the best regularization strategy when the feature space is exceptionally friendly — full rank, evenly-distributed eigenvalues, and even t… view at source ↗

**Figure 6.** Figure 6: ElasticNetCV was most likely to yield the smallest RMSE within the test set. As with the F1 and L2 error statistics from the previous section, the omega-squared statistics are presented below in [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 7.** Figure 7: Consolidated decision framework for regularization selection. Decision rules branch on observable quantities: the ratio p/n (under-determined check), the sample-to-feature ratio n/p (largesample threshold ≥ 78, empirically determined from [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

read the original abstract

This study surveys the historical development of regularization, tracing its evolution from stepwise regression in the 1960s to recent advancements in formal error control, structured penalties for non-independent features, Bayesian methods, and l0-based regularization (among other techniques). We empirically evaluate the performance of four canonical frameworks -- Ridge, Lasso, ElasticNet, and Post-Lasso OLS -- across 134,400 simulations spanning a 7-dimensional manifold grounded in eight production-grade machine learning models. Our findings demonstrate that for prediction accuracy when the sample-to-feature ratio is sufficient (n/p >= 78), Ridge, Lasso, and ElasticNet are nearly interchangeable. However, we find that Lasso recall is highly fragile under multicollinearity; at high condition numbers (kappa) and low SNR, Lasso recall collapses to 0.18 while ElasticNet maintains 0.93. Consequently, we advise practitioners against using Lasso or Post-Lasso OLS at high kappa with small sample sizes. The analysis concludes with an objective-driven decision guide to assist machine learning engineers in selecting the optimal scikit-learn-supported framework based on observable feature space attributes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript surveys the historical development of regularization techniques and presents findings from an extensive simulation study involving 134,400 runs across a 7-dimensional manifold derived from eight production-grade machine learning models. It evaluates four regularization frameworks—Ridge, Lasso, ElasticNet, and Post-Lasso OLS—concluding that for prediction accuracy with sufficient sample-to-feature ratios (n/p >= 78), Ridge, Lasso, and ElasticNet are nearly interchangeable. However, Lasso's recall is shown to be highly fragile under multicollinearity, collapsing to 0.18 at high condition numbers and low SNR, while ElasticNet maintains 0.93. The paper advises against using Lasso or Post-Lasso OLS in high kappa, small sample scenarios and provides an objective-driven decision guide based on observable feature space attributes.

Significance. Should the simulation results generalize, this study delivers valuable empirical benchmarks for applied ML practitioners using scikit-learn, clarifying when popular regularizers can be used interchangeably and identifying specific conditions where Lasso fails in feature recovery. The large simulation scale supports the reported performance contrasts and the conditioned decision rule, potentially aiding better model selection in high-dimensional data settings.

major comments (1)

[Simulation Setup] The claim that the 7-dimensional manifold sufficiently captures key factors determining regularization performance relies on its grounding in eight production-grade models. The manuscript should provide more explicit details on how the parameter ranges and feature-generation procedure were selected from these models, as this is load-bearing for the generalizability of the advice on avoiding Lasso at high kappa.

minor comments (1)

[Results] The exact replication count and how the 134,400 simulations are distributed across the parameter space could be summarized in a table for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the work and recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: The claim that the 7-dimensional manifold sufficiently captures key factors determining regularization performance relies on its grounding in eight production-grade models. The manuscript should provide more explicit details on how the parameter ranges and feature-generation procedure were selected from these models, as this is load-bearing for the generalizability of the advice on avoiding Lasso at high kappa.

Authors: We agree that additional explicit details on the manifold construction will strengthen the generalizability claims. The parameter ranges for the seven dimensions (n, p, SNR, kappa, sparsity, noise structure, and correlation type) were derived by first fitting the eight production-grade models to representative datasets and extracting empirical quantiles for each factor; the feature-generation procedure then samples covariance matrices via eigenvalue scaling to achieve target condition numbers while preserving the observed marginal distributions. In the revised manuscript we will add a dedicated subsection (new Section 3.1) containing a table of the exact quantile ranges extracted from each model, the eigenvalue-adjustment algorithm, and a short justification for why these ranges cover the relevant operating regime for the Lasso-failure advice. This revision directly supports the conditioned recommendation against Lasso at high kappa. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a purely empirical simulation study with no derivations, self-referential equations, or fitted parameters underlying the claims. All reported metrics are direct outputs from 134400 controlled experiments on a 7-dimensional manifold. The central contrasts (e.g., Lasso vs ElasticNet recall at high kappa/low SNR) are traceable to the explicit simulation procedure and replication count supplied in the text, with no load-bearing steps that reduce by construction to inputs or self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claims rest on the assumption that the simulation design and chosen models represent real-world regularization behavior; no free parameters or new entities are introduced.

axioms (1)

domain assumption The 7-dimensional manifold of simulation parameters grounded in eight production-grade ML models captures relevant real-world variability for regularization performance
This assumption justifies generalizing the benchmark results and decision guide to applied ML practice.

pith-pipeline@v0.9.0 · 5498 in / 1218 out tokens · 51140 ms · 2026-05-13T18:54:14.089761+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

[1]

Kijowski and W

Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the surprising behavior of distance metrics in high dimensional space.Database Theory — ICDT 2001, 420–434. https://doi.org/10.1007/3- 540-44503-X 27

work page doi:10.1007/3- 2001
[2]

Belloni, A., & Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse models.Bernoulli,19(2), 521–547. https://doi.org/10.3150/11-BEJ410

work page doi:10.3150/11-bej410 2013
[3]

Berk, R., Brown, L., Buja, A., Zhang, K., & Zhao, L. (2013). Valid post-selection inference.The Annals of Statistics,41(2), 802–837. https://doi.org/10.1214/12-AOS1077

work page doi:10.1214/12-aos1077 2013
[4]

Bertsimas, D., King, A., & Mazumder, R. (2016). Best subset selection via a modern optimization lens.Annals of Statistics,44(2), 813–852. https://doi.org/10.1214/15-AOS1388

work page doi:10.1214/15-aos1388 2016
[5]

Bertsimas, D., Pauphilet, J., & Van Parys, B. (2020a). Rejoinder: Sparse regression: Scalable algorithms and empirical performance.Statistical Science,35(4), 623–624. https://doi.org/10.1214/20- STS701REJ

work page doi:10.1214/20-
[6]

Bertsimas, D., Pauphilet, J., & Van Parys, B. (2021). Sparse classification: A scalable discrete optimiza- tion perspective.Machine Learning,110(11–12), 3177–3209. https://doi.org/10.1007/s10994- 021-06085-5

work page doi:10.1007/s10994- 2021
[7]

Bertsimas, D., & Van Parys, B. (2020). Sparse high-dimensional regression: Exact scalable algorithms and phase transitions.Annals of Statistics,48(1), 300–323. https://doi.org/10.1214/18- AOS1804

work page doi:10.1214/18- 2020
[8]

J., Ritov, Y., & Tsybakov, A

Bickel, P. J., Ritov, Y., & Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics,37(4), 1705–1732. https://doi.org/10.1214/08-AOS620

work page doi:10.1214/08-aos620 2009
[9]

Breiman, L. (1992). The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error.Journal of the American Statistical Association,87(419), 738–754. https://www.jstor.org/stable/2290212

work page arXiv 1992
[10]

Breiman, L. (1995). Better subset regression using the nonnegative garrote.Technometrics,37(4), 373–384. https://doi.org/10.1080/00401706.1995.10484371

work page doi:10.1080/00401706.1995.10484371 1995
[11]

Breiman, L. (1996). Heuristics of instability and stabilization in model selection.The Annals of Statis- tics,24(6), 2350–2383. http://www.jstor.org/stable/2242688 B¨ uhlmann, P., & van de Geer, S. (2011).Statistics for high-dimensional data: Methods, theory and applications. Springer. https://doi.org/10.1007/978-3-642-20192-9 Cand` es, E., Fan, Y., Janso...

work page doi:10.1007/978-3-642-20192-9 1996
[12]

M., Polson, N

Carvalho, C. M., Polson, N. G., & Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika,97(2), 465–480. https://doi.org/10.1093/biomet/asq017

work page doi:10.1093/biomet/asq017 2010
[13]

Chen, Y., Taeb, A., & B¨ uhlmann, P. (2020). A look at robustness and stability ofℓ 1-versusℓ 0- regularization: Discussion of papers by bertsimas et al. and hastie et al.Statistical Science, 35(4), 614–622. https://doi.org/10.1214/20-STS809

work page doi:10.1214/20-sts809 2020
[14]

(1988).Statistical power analysis for the behavioral sciences(2nd)

Cohen, J. (1988).Statistical power analysis for the behavioral sciences(2nd). Lawrence Erlbaum As- sociates

work page 1988
[15]

Copas, J. B. (1983). Regression, prediction and shrinkage.Journal of the Royal Statistical Society. Series B (Methodological),45(3), 311–354. http://www.jstor.org/stable/2345402

work page arXiv 1983
[16]

R., & Smith, H

Draper, N. R., & Smith, H. (1998).Applied regression analysis(3rd ed.). Wiley. https://doi.org/10. 1002/9781118625590

work page 1998
[17]

Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression.Annals of Statistics, 32(2), 407–499. https://doi.org/10.1214/009053604000000067

work page doi:10.1214/009053604000000067 2004
[18]

Efroymson, M. A. (1960). Multiple regression analysis. InMathematical methods for digital computers (pp. 191–203). Wiley. 30

work page 1960
[19]

Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association,96(456), 1348–1360. https://doi.org/10.1198/ 016214501753382273

work page 2001
[20]

Fan, J., & Song, R. (2010). Sure independence screening in generalized linear models with np-dimensionality. Annals of Statistics,38(6), 3567–3604. https://doi.org/10.1214/10-AOS798

work page doi:10.1214/10-aos798 2010
[21]

Fei, Z., Zhu, J., Banerjee, M., & Li, Y. (2019). Drawing inferences for high-dimensional linear models: A selection-assisted partial regression and smoothing approach.Biometrics,75(2), 551–561. https://doi.org/10.1111/biom.13013 Fitzgerald Sice, J., Lattimore, F., Robinson, T., & Zhu, A. (2026). Double LASSO: Replication and practical insights [Published ...

work page doi:10.1111/biom.13013 2019
[22]

E., & Friedman, J

Frank, I. E., & Friedman, J. H. (1993). A statistical view of some chemometrics regression tools (with discussion).Technometrics,35, 109–148. Freijeiro-Gonz´ alez, L., Febrero-Bande, M., & Gonz´ alez-Manteiga, W. (2022). A critical review of LASSO and its derivatives for variable selection under dependence among covariates.Interna- tional Statistical Revi...

work page doi:10.1111/insr.12469 1993
[23]

Fu, W. J. (1998). Penalized regressions: The bridge versus the lasso.Journal of Computational and Graphical Statistics,7(3), 397–416. https://doi.org/10.2307/1390712

work page doi:10.2307/1390712 1998
[24]

Gamarnik, D., & Zadik, I. (2017). High dimensional regression with binary coefficients: Estimating squared error and a phase transition.Proceedings of the 2017 Conference on Learning Theory, 65, 948–953. https://proceedings.mlr.press/v65/david17a.html

work page 2017
[25]

(2009).The elements of statistical learning

Hastie, T., Tibshirani, R., & Friedman, J. (2009).The elements of statistical learning. Springer. https: //doi.org/10.1007/b94608 1

work page doi:10.1007/b94608 2009
[26]

H., Meier, L., & Fraley, C

Hesterberg, T., Choi, N. H., Meier, L., & Fraley, C. (2008). Least angle andℓ1 penalized regression: A review.Statistics Surveys,2, 61–93. https://doi.org/10.1214/08-SS035

work page doi:10.1214/08-ss035 2008
[27]

E., & Kennard, R

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Applications to nonorthogonal problems. Technometrics,12(1), 69–82. https://doi.org/10.2307/1267352

work page doi:10.2307/1267352 1970
[28]

Huang, J., Breheny, P., & Ma, S. (2012). A selective review of group selection in high-dimensional models [Accessed February 15, 2026].Statistical Science,27(4). https://doi.org/10.1214/12- STS392

work page doi:10.1214/12- 2012
[29]

Javanmard, A., & Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression.Journal of Machine Learning Research,15, 2869–2909. https://jmlr.org/papers/ v15/javanmard14a.html

work page 2014
[30]

Lederer, J., & Vogt, M. (2021). Estimating the Lasso’s effective noise.Journal of Machine Learning Research,22(276), 1–32. https://www.jmlr.org/papers/v22/20-539.html

work page 2021
[31]

D., Sun, D

Lee, J. D., Sun, D. L., Sun, Y., & Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso.Annals of Statistics,44(3), 907–927. https://doi.org/10.1214/15-AOS1371

work page doi:10.1214/15-aos1371 2016
[32]

(2023, May 22).Why is the ridge regression loss not normalized by the number of samples? [Comment byglemaitreon GitHub Discussion]

Lemaitre, G. (2023, May 22).Why is the ridge regression loss not normalized by the number of samples? [Comment byglemaitreon GitHub Discussion]. scikit-learn. https://github.com/scikit- learn/scikit-learn/discussions/23407

work page 2023
[33]

Leng, C., Lin, Y., & Wahba, G. (2006). A note on the lasso and related procedures in model selection. Statistica Sinica,16(4), 1273–1284. http://www.jstor.org/stable/24307787

work page arXiv 2006
[34]

Luo, J., Kong, Y., & Li, G. (2026). From penalization to over-parameterization: Achieving implicit regularization for high-dimensional linear errors-in-variables models [Published online ahead of print].Journal of Business & Economic Statistics, 1–13. https://doi.org/10.1080/07350015. 2025.2583457

work page doi:10.1080/07350015 2026
[35]

The Dantzig selector

Meinshausen, N., Rocha, G., & Yu, B. (2007). Discussion: A tale of three cousins: Lasso, L2Boosting and Dantzig [Discussion of “The Dantzig selector” by Candes and Tao].The Annals of Statistics, 35(6), 2373–2384. https://doi.org/10.1214/009053607000000523

work page doi:10.1214/009053607000000523 2007
[36]

Meinshausen, N., & B¨ uhlmann, P. (2010). Stability selection.Journal of the Royal Statistical Society. Series B (Statistical Methodology),72(4), 417–473. https://doi.org/10.1111/j.1467-9868.2010. 00740.x

work page doi:10.1111/j.1467-9868.2010 2010
[37]

Miller, A. J. (2019).Subset selection in regression. Chapman; Hall/CRC

work page 2019
[38]

Park, T., & Casella, G. (2008). The bayesian lasso.Journal of the American Statistical Association, 103(482), 681–686. https://doi.org/10.1198/016214508000000337 31

work page doi:10.1198/016214508000000337 2008
[39]

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Pret- tenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in python.Journal of Machine Learning Research,12, 2825–2830

work page 2011
[40]

J., & El Ghaoui, L

Pilanci, M., Wainwright, M. J., & El Ghaoui, L. (2015). Sparse learning via boolean relaxations. Mathematical Programming,151(1), 63–87. https://doi.org/10.1007/s10107-015-0894-1

work page doi:10.1007/s10107-015-0894-1 2015
[41]

C., & Pun, F

Rencher, A. C., & Pun, F. C. (1980). Inflation of r²in best subset regression.Technometrics,22(1), 49–53. https://doi.org/10.2307/1268382

work page doi:10.2307/1268382 1980
[42]

Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2013). A sparse-group lasso.Journal of Com- putational and Graphical Statistics,22(2), 231–245. https://doi.org/10.1080/10618600.2012. 681250

work page doi:10.1080/10618600.2012 2013
[43]

Su, W., Bogdan, M., & Cand` es, E. (2017). False discoveries occur early on the lasso path.Annals of Statistics,45(5), 2133–2150. https://doi.org/10.1214/16-AOS1521

work page doi:10.1214/16-aos1521 2017
[44]

Tang, S., Wu, J., Fan, J., & Jin, C. (2025). Benign overfitting in out-of-distribution generalization of linear models.Proceedings of the International Conference on Learning Representations (ICLR)

work page 2025
[45]

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society. Series B (Methodological),58(1), 267–288. http://www.jstor.org/stable/2346178

work page arXiv 1996
[46]

Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., & Knight, K. (2005). Sparsity and smoothness via the fused lasso [Accessed February 15, 2026].Journal of the Royal Statistical Society. Series B (Statistical Methodology),67(1), 91–108. https://academic.oup.com/jrsssb/article- abstract/67/1/91/7110658 van de Geer, S., B¨ uhlmann, P., Ritov, Y., & Dezeure,...

work page doi:10.1214/14-aos1221 2005
[47]

Wainwright, M. J. (2009). Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting.IEEE Transactions on Information Theory,55(12), 5728–5741. https:// doi.org/10.1109/tit.2009.2032816

work page doi:10.1109/tit.2009.2032816 2009
[48]

J., & Ramchandran, K

Wang, W., Wainwright, M. J., & Ramchandran, K. (2010). Information-theoretic limits on sparse signal recovery: Dense versus sparse measurement matrices.IEEE Transactions on Information Theory,56(6), 2967–2979. https://doi.org/10.1109/tit.2010.2046199

work page doi:10.1109/tit.2010.2046199 2010
[49]

Wilkinson, L., & Dallal, G. E. (1981). Tests of significance in forward selection regression with an f-to-enter stopping rule.Technometrics,23(4), 377–380. https://doi.org/10.1080/00401706. 1981.10487682

work page doi:10.1080/00401706 1981
[50]

Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables [Accessed February 15, 2026].Journal of the Royal Statistical Society,68, 49–67. https:// www.scirp.org/reference/referencespapers?referenceid=2514485

work page 2006
[51]

Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty.Annals of Statistics,38(2), 894–942. https://doi.org/10.1214/09-aos729

work page doi:10.1214/09-aos729 2010
[52]

Zhang, C.-H., & Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high di- mensional linear models.Journal of the Royal Statistical Society: Series B (Statistical Method- ology),76(1), 217–242

work page 2014
[53]

Zou, H. (2006). The adaptive lasso and its oracle properties.Journal of the American Statistical Association,101(476), 1418–1429. https://doi.org/10.1198/016214506000000735

work page doi:10.1198/016214506000000735 2006
[54]

Zou, H., & Hastie, T. (2005). Addendum: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B (Statistical Methodology),67(5), 768–768. https://doi.org/10.1111/j.1467-9868.2005.00527.x 32 Appendix A Sample Model Feature Distributions Figure A.1:Distributions of eigenvalues across 8 ML models sampled ...

work page doi:10.1111/j.1467-9868.2005.00527.x 2005