Bias-Aware External-Model-Assisted Inference in High-Dimensional Regression
Pith reviewed 2026-06-27 04:27 UTC · model grok-4.3
The pith
A bias-aware shrinkage step routes external predictors into the variance of debiased high-dimensional estimators, producing shorter valid intervals than PPI or debiased Lasso.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Debiased External-model-Assisted Lasso (DEAL) routes the external estimator and the unlabeled covariates into the variance of a debiased estimator, with a bias-aware, cross-fitted shrinkage step that adapts across target-only, near-oracle, and biased-but-informative regimes. It proves coordinate-wise asymptotic normality with an adaptive variance, extends validity to the projection parameter under misspecification and nonlinear labelers, and shows that, at a common unlabeled budget, DEAL intervals are shorter than those of debiased Lasso, PPI, and PPI++; a shift-aware variant preserves coverage under covariate shift.
What carries the argument
The bias-aware, cross-fitted shrinkage step that decides how much external-model information to fold into the variance estimator based on estimated bias.
If this is right
- At fixed unlabeled budget, DEAL intervals are shorter than those from debiased Lasso, PPI, and PPI++.
- The procedure remains valid for the projection parameter when the external model is misspecified or the labeler is nonlinear.
- A shift-aware variant maintains coverage when the distribution of unlabeled covariates differs from the labeled data.
- In simulations the interval lengths are between 0.49 and 0.87 times the debiased-Lasso length; in real data the median ratios range from 0.23 to 0.53.
Where Pith is reading between the lines
- The same adaptive routing of external information into variance could be tried with other high-dimensional estimators beyond the Lasso.
- When the external model is a large language model, DEAL offers a way to obtain shorter scientific intervals without retraining the model on the target labels.
- The length gains may be largest when unlabeled data are cheap relative to labeled data, suggesting a practical allocation rule for future studies.
Load-bearing premise
The cross-fitted shrinkage step adapts correctly to different external-model qualities without introducing bias that the asymptotic normality argument does not capture.
What would settle it
A high-dimensional simulation in which the external predictor carries substantial bias yet the shrinkage step fails to downweight it enough, producing intervals whose empirical coverage falls below the nominal level.
Figures
read the original abstract
In high-dimensional semi-supervised linear regression, prediction-powered inference (PPI) corrects an external predictor with a rectifier estimated from the labeled data. In a linear model, however, this rectifier cancels the predictor: PPI and PPI++ reduce to ordinary least squares and can inflate variance when the predictor is close to the oracle. We propose the Debiased External-model-Assisted Lasso (DEAL), which routes the external estimator and the unlabeled covariates into the variance of a debiased estimator, with a bias-aware, cross-fitted shrinkage step that adapts across target-only, near-oracle, and biased-but-informative regimes. We prove coordinate-wise asymptotic normality with an adaptive variance, extend validity to the projection parameter under misspecification and nonlinear labelers, and show that, at a common unlabeled budget, DEAL intervals are shorter than those of debiased Lasso, PPI, and PPI++; a shift-aware variant preserves coverage under covariate shift. In simulations, DEAL intervals are 0.49-0.87 of the debiased-Lasso length, and across six real-data applications spanning astronomy, chemistry, proteomics, and oncology, the last using a large-language-model oracle, they tighten in every case, with median length ratios of 0.23-0.53.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Debiased External-model-Assisted Lasso (DEAL) for high-dimensional semi-supervised linear regression. It routes an external predictor through a bias-aware cross-fitted shrinkage step into the variance of a debiased estimator, proving coordinate-wise asymptotic normality with an adaptive variance. The method extends validity to the projection parameter under misspecification and nonlinear labelers, claims shorter intervals than debiased Lasso, PPI, and PPI++ at fixed unlabeled budget, and includes a shift-aware variant; simulations and six real-data examples (astronomy to oncology with LLM oracle) report length ratios of 0.23-0.87.
Significance. If the asymptotic results hold, DEAL offers a principled way to adaptively incorporate external models in high-dimensional inference without the variance inflation seen in PPI when predictors are near-oracle, while preserving coverage under misspecification. Strengths include the coordinate-wise normality proof, extension to projection parameters, and consistent empirical tightening across regimes and datasets.
major comments (2)
- [§3.2, Theorem 1] §3.2 (bias-aware cross-fitted shrinkage): the proof of coordinate-wise asymptotic normality (Theorem 1) does not supply an explicit bound on the remainder term arising from the data-dependent shrinkage parameter. Without showing that this term is asymptotically negligible uniformly across target-only, near-oracle, and biased-but-informative regimes, residual dependence between cross-fit folds and the primary estimating equation may alter both centering and the claimed adaptive variance formula.
- [§4] §4 (extension to projection parameter under misspecification): the validity claim for nonlinear labelers relies on the same cross-fitted shrinkage step, yet the expansion does not address how the shrinkage adaptation interacts with the misspecification bias term; this is load-bearing for the statement that DEAL remains valid when the external model is biased but informative.
minor comments (2)
- [§3.1] Notation for the shrinkage parameter λ and its cross-fit estimator should be introduced with an explicit equation before its use in the variance formula.
- [Table 1] Table 1 (simulation length ratios): report the number of Monte Carlo replications and whether the reported intervals are averaged over coordinates or selected coordinates.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3.2, Theorem 1] §3.2 (bias-aware cross-fitted shrinkage): the proof of coordinate-wise asymptotic normality (Theorem 1) does not supply an explicit bound on the remainder term arising from the data-dependent shrinkage parameter. Without showing that this term is asymptotically negligible uniformly across target-only, near-oracle, and biased-but-informative regimes, residual dependence between cross-fit folds and the primary estimating equation may alter both centering and the claimed adaptive variance formula.
Authors: We agree that an explicit bound on the remainder term would strengthen the proof. In the revision we will add a supplementary lemma deriving such a bound and verifying its asymptotic negligibility uniformly across the three regimes. This will confirm that cross-fitting removes any residual dependence that could affect centering or the adaptive variance formula. revision: yes
-
Referee: [§4] §4 (extension to projection parameter under misspecification): the validity claim for nonlinear labelers relies on the same cross-fitted shrinkage step, yet the expansion does not address how the shrinkage adaptation interacts with the misspecification bias term; this is load-bearing for the statement that DEAL remains valid when the external model is biased but informative.
Authors: We concur that the interaction between shrinkage adaptation and the misspecification bias term should be made explicit. The revised §4 will augment the expansion to detail this interaction, thereby confirming validity for the projection parameter under nonlinear labelers and biased-but-informative external models. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper extends existing debiased Lasso and PPI methods with a new bias-aware cross-fitted shrinkage step, then claims to prove coordinate-wise asymptotic normality with an adaptive variance that holds under misspecification. No equations or steps in the abstract reduce the adaptive variance, interval lengths, or normality result to a fitted quantity defined by the same data by construction. The derivation builds on prior external methods without load-bearing self-citations or uniqueness theorems imported from the authors' own prior work; the central claims rest on stated assumptions and new proofs rather than self-referential definitions or renaming of known results.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The data follow a linear model in which the rectifier of PPI cancels the external predictor.
- domain assumption Cross-fitting produces an adaptive shrinkage factor that remains valid across predictor-quality regimes.
Reference graph
Works this paper leans on
-
[1]
Science , volume=
Prediction-powered inference , author=. Science , volume=. 2023 , publisher=
2023
-
[2]
The Journal of Machine Learning Research , volume=
Confidence intervals and hypothesis testing for high-dimensional regression , author=. The Journal of Machine Learning Research , volume=. 2014 , publisher=
2014
-
[3]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=
Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2022 , publisher=
2022
-
[4]
Angelopoulos, Anastasios N and Duchi, John C and Zrnic, Tijana , journal=
-
[5]
The Annals of Statistics , volume=
On asymptotically optimal confidence regions and tests for high-dimensional models , author=. The Annals of Statistics , volume=. 2014 , publisher=
2014
-
[6]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=
Confidence intervals for low dimensional parameters in high dimensional linear models , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2014 , publisher=
2014
-
[7]
The Annals of Statistics , volume=
High-dimensional graphs and variable selection with the lasso , author=. The Annals of Statistics , volume=. 2006 , publisher=
2006
-
[8]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=
Stability selection , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=
-
[9]
Normal approximations with
Nourdin, Ivan and Peccati, Giovanni , series=. Normal approximations with. 2012 , publisher=
2012
-
[10]
Walmsley, Mike and others , journal=. Galaxy
-
[11]
Bamford, Steven P and others , journal=
-
[12]
Physical content of the exact
Perdew, John P and Levy, Mel , journal=. Physical content of the exact
-
[13]
Physical Chemistry Chemical Physics , volume=
Screened hybrid density functionals for solid-state chemistry and physics , author=. Physical Chemistry Chemical Physics , volume=
-
[14]
Benchmarking materials property prediction methods: the
Dunn, Alexander and Wang, Qi and Ganose, Alex and Dopp, Daniel and Jain, Anubhav , journal=. Benchmarking materials property prediction methods: the
-
[15]
Compensatory water effects link yearly global land
Jung, Martin and others , journal=. Compensatory water effects link yearly global land
-
[16]
Nature , volume=
Large influence of soil moisture on long-term terrestrial carbon uptake , author=. Nature , volume=
-
[17]
Recent pause in the growth rate of atmospheric
Keenan, Trevor F and Prentice, I Colin and Canadell, Josep G and others , journal=. Recent pause in the growth rate of atmospheric
-
[18]
Integrating the evidence for a terrestrial carbon sink caused by increasing atmospheric
Walker, Anthony P and others , journal=. Integrating the evidence for a terrestrial carbon sink caused by increasing atmospheric
-
[19]
Nature Climate Change , volume=
The increasing importance of atmospheric demand for ecosystem water and carbon fluxes , author=. Nature Climate Change , volume=
-
[20]
Sensitivity of atmospheric
Humphrey, Vincent and Zscheischler, Jakob and Ciais, Philippe and others , journal=. Sensitivity of atmospheric
-
[21]
Terrestrial carbon balance in a drier world: the effects of water availability in southwestern
Biederman, Joel A and others , journal=. Terrestrial carbon balance in a drier world: the effects of water availability in southwestern
-
[22]
Proceedings of the National Academy of Sciences , volume=
Land--atmosphere feedbacks exacerbate concurrent soil drought and atmospheric aridity , author=. Proceedings of the National Academy of Sciences , volume=
-
[23]
Science Advances , volume=
Dependence of drivers affects risks associated with compound events , author=. Science Advances , volume=
-
[24]
Predicting carbon dioxide and energy fluxes across global
Tramontana, Gianluca and Jung, Martin and Schwalm, Christopher R and others , journal=. Predicting carbon dioxide and energy fluxes across global
-
[25]
Earth System Science Data , volume=
Upscaled diurnal cycles of land--atmosphere fluxes: a new global half-hourly data product , author=. Earth System Science Data , volume=
-
[26]
New perspective on spring vegetation phenology and global climate change based on
Yang, Bao and others , journal=. New perspective on spring vegetation phenology and global climate change based on
-
[27]
Nature Medicine , volume=
Plasma protein patterns as comprehensive indicators of health , author=. Nature Medicine , volume=
-
[28]
Nature Medicine , volume=
Proteomic signatures improve risk prediction for common and rare diseases , author=. Nature Medicine , volume=
-
[29]
European Heart Journal , volume=
Proteomic cardiovascular risk assessment in chronic kidney disease , author=. European Heart Journal , volume=
-
[30]
Bild, Diane E and Bluemke, David A and Burke, Gregory L and others , journal=. Multi-
-
[31]
Feldman, Harold I and Appel, Lawrence J and Chertow, Glenn M and others , journal=. The
-
[32]
Journal of the American Medical Association , volume=
A genomic predictor of response and survival following taxane--anthracycline chemotherapy for invasive breast cancer , author=. Journal of the American Medical Association , volume=
-
[33]
Journal of Clinical Oncology , volume=
Long-term prognostic risk after neoadjuvant chemotherapy associated with residual cancer burden and breast cancer subtype , author=. Journal of Clinical Oncology , volume=
-
[34]
Nature Medicine , volume=
High-throughput screening using patient-derived tumor xenografts to predict clinical trial drug response , author=. Nature Medicine , volume=
-
[35]
Alpelisib for
Andr. Alpelisib for. New England Journal of Medicine , volume=
-
[36]
Biological characterization of
Yeh, Tony C and Marsh, Vivien and Bernat, Beth A and Ballard, Joshua and Colwell, Hillary and Evans, Rebecca J and Parry, Janet and Smith, Darnell and Brandhuber, Barbara J and Gross, Susan and others , journal=. Biological characterization of. 2007 , publisher=
2007
-
[37]
Genomics of Drug Sensitivity in Cancer (
Yang, Wanjuan and Soares, Jorge and Greninger, Patricia and Edelman, Elena J and Lightfoot, Howard and Forbes, Simon and Bindal, Nidhi and Beare, Dave and Smith, James A and Thompson, I Richard and others , journal=. Genomics of Drug Sensitivity in Cancer (. 2013 , publisher=
2013
-
[38]
Proceedings of the National Academy of Sciences , volume=
Cross-prediction-powered inference , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=
2024
-
[39]
Proceedings of the 41st International Conference on Machine Learning , year=
Active statistical inference , author=. Proceedings of the 41st International Conference on Machine Learning , year=
-
[40]
The Annals of Statistics , volume=
Efficient and adaptive linear regression in semi-supervised settings , author=. The Annals of Statistics , volume=. 2018 , publisher=
2018
-
[41]
The Annals of Statistics , volume=
Semi-supervised inference: General theory and estimation of means , author=. The Annals of Statistics , volume=. 2019 , publisher=
2019
-
[42]
Journal of the American Statistical Association , volume=
Transfer learning under high-dimensional generalized linear models , author=. Journal of the American Statistical Association , volume=. 2023 , publisher=
2023
-
[43]
Keret, Nir and Shojaie, Ali , journal=
-
[44]
arXiv preprint arXiv:2510.08123 , year=
High-dimensional analysis of synthetic data selection , author=. arXiv preprint arXiv:2510.08123 , year=
-
[45]
Bernoulli , volume=
Concentration inequalities and moment bounds for sample covariance operators , author=. Bernoulli , volume=. 2017 , publisher=
2017
-
[46]
Biometrika , volume=
Scaled sparse linear regression , author=. Biometrika , volume=. 2012 , publisher=
2012
-
[47]
Proceedings of the National Academy of Sciences , volume=
Methods for correcting inference based on outcomes predicted by machine learning , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=
2020
-
[48]
Journal of Machine Learning Research , volume=
Revisiting inference after prediction , author=. Journal of Machine Learning Research , volume=
-
[49]
Physical Review Letters , volume=
Generalized gradient approximation made simple , author=. Physical Review Letters , volume=. 1996 , publisher=
1996
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.