arxiv: 2604.22391 · v1 · submitted 2026-04-24 · 📊 stat.ML · cs.LG· stat.CO· stat.ME

Recognition: unknown

Conformalized Super Learner

Zhanli Wu , Fabrizio Leisen , Miguel-Angel Luque-Fernandez , F. Javier Rubio

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:04 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.COstat.ME

keywords super learnerconformal predictionprediction intervalsensemble methodsfinite-sample coverageregressionmedical prediction

0 comments

The pith

The Super Learner ensemble produces prediction intervals with finite-sample coverage by weighting learner conformity scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes integrating conformal prediction into the Super Learner by applying the ensemble's weights to combine conformity scores from each learner through a weighted majority vote. This construction mirrors the original Super Learner framework and yields intervals for continuous outcomes with coverage guarantees under exchangeability and related conditions. Simulations confirm valid coverage with performance close to the ideal mechanism, and the method is demonstrated on a medical dataset for predicting creatinine levels from socio-demographic, biometric, and lab data.

Core claim

By mirroring the Super Learner weighting on learner-specific conformity scores and combining them via weighted majority vote, the resulting intervals achieve finite-sample coverage guarantees for continuous outcomes under exchangeability, potential violations, heteroscedasticity, sparsity, and other distributional heterogeneity.

What carries the argument

Weighted majority vote of learner-specific conformity scores that reuses the Super Learner's performance-based weights.

If this is right

The intervals maintain valid finite-sample coverage under exchangeability and mild violations.
Simulations show competitive performance relative to the true data-generating process.
The approach applies directly to regression tasks with non-linear effects, interactions, sparsity, heteroscedasticity, and outliers.
It produces usable intervals in medical prediction settings such as creatinine level estimation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same weighting-plus-vote construction could be applied to other performance-weighted ensembles beyond the Super Learner.
Extensions to dependent data structures would require only adjustments to the conformity score step.
In high-stakes applications the method could reduce reliance on bootstrap or asymptotic interval methods.

Load-bearing premise

The data points satisfy exchangeability or the mild conditions that allow conformal prediction to guarantee coverage.

What would settle it

A dataset generated under exchangeability where the empirical coverage of the constructed intervals falls substantially below the nominal level.

Figures

Figures reproduced from arXiv: 2604.22391 by Fabrizio Leisen, F. Javier Rubio, Miguel-Angel Luque-Fernandez, Zhanli Wu.

**Figure 1.** Figure 1: Full-CSL prediction intervals for testing observations ordered by the observed response. point predictions follow the central location of the observed values across most of the sample. Interval widths remain fairly stable over much of the ordered sequence but become more variable toward the upper tail, reflecting greater predictive uncertainty for larger creatinine values. The figure also shows that severa… view at source ↗

**Figure 2.** Figure 2: Covariate-specific full-CSL prediction intervals for three representative profiles. 1 2 3 4 Serum creatinine (LBXSCR) Serum creatinine (LBXSCR) Density 1 2 3 4 0.0 0.5 1.0 1.5 2.0 view at source ↗

**Figure 3.** Figure 3: Boxplot and histogram for the response (LBXSCR). 0.3 0.6 0.9 1.2 1.5 0 100 200 300 400 500 Ordered Testing Point Serum creatinine (LBXSCR) Point prediction True value view at source ↗

**Figure 4.** Figure 4: Full-CSL prediction intervals for testing observations ordered by the observed response after removing outliers. 24 view at source ↗

**Figure 5.** Figure 5: Covariate-specific Full-CSL prediction intervals for three representative profiles after removing outliers. 1 2 3 4 0 100 200 300 400 500 Ordered Testing Point Serum creatinine (LBXSCR) Point prediction True value view at source ↗

**Figure 6.** Figure 6: Split-CSL prediction intervals for testing observations ordered by the observed response. 25 view at source ↗

**Figure 7.** Figure 7: Covariate-specific Split-CSL prediction intervals for three representative profiles. 0.3 0.6 0.9 1.2 1.5 0 100 200 300 400 500 Ordered Testing Point Serum creatinine (LBXSCR) Point prediction True value view at source ↗

**Figure 8.** Figure 8: Split-CSL prediction intervals for testing observations ordered by the observed response after removing outliers. 26 view at source ↗

**Figure 9.** Figure 9: Covariate-specific Split-CSL prediction intervals for three representative profiles after removing outliers. Liver Condition (MCQ160L) Prescription Medicine Use (RXQ033) High Blood Pressure (BPQ020) Diabetes Status (DIQ010) Albumin (LBXSAL) Total Protein (LBXSTP) Serum Phosphorus (LBXSPH) Serum Calcium (LBXSCA) Body Mass Index (BMXBMI) Triglycerides (LBXSTR) Waist Circumference (BMXWAIST) Race/Ethnicity (R… view at source ↗

**Figure 10.** Figure 10: Random Forest variable importance (IncNodePurity) for predicting serum creatinine. Left: full data set; Right: after removing observations with serum creatinine > 1.5 mg/dL. 27 view at source ↗

**Figure 11.** Figure 11: Regression tree fitted as an interpretable surrogate for the dominant Random Forest learner. Sex = Female UricAcid < 5.55 Race = MexAm,OthHisp,Asian Age < 67.5 UricAcid < 5.25 Age < 68.5 Male >= 5.55 White,Black,Multi >= 67.5 >= 5.25 >= 68.5 n=4431 100.0% n=2420 54.6% n=1935 43.7% n=461 10.4% n=1474 33.3% n=485 10.9% n=333 7.5% n=152 3.4% n=2011 45.4% n=681 15.4% n=1330 30.0% n=1054 23.8% n=276 6.2% 0.866… view at source ↗

**Figure 12.** Figure 12: Regression tree fitted as an interpretable surrogate for the dominant Random Forest learner after removing outliers. 28 view at source ↗

**Figure 13.** Figure 13: Categorical covariate-specific Full-CSL prediction intervals for three representative profiles. RIAGENDR RIDRETH3 MCQ160L DIQ010 BPQ020 RXQ033 High observed profile Middle observed profile Low observed profile Male Female Mexican American Non−Hispanic Asian Non−Hispanic Black Non−Hispanic White Other Hispanic Other Race − Including Multi−Racial Yes No Yes No Borderline Yes No Yes No 0.5 1.0 1.5 0.5 1.0 1.… view at source ↗

**Figure 14.** Figure 14: Categorical covariate-specific Split-CSL prediction intervals for three representative profiles. 30 view at source ↗

**Figure 15.** Figure 15: Categorical covariate-specific Full-CSL prediction intervals for three representative profiles after removing outliers. RIAGENDR RIDRETH3 MCQ160L DIQ010 BPQ020 RXQ033 High observed profile Middle observed profile Low observed profile Male Female Mexican American Non−Hispanic Asian Non−Hispanic Black Non−Hispanic White Other Hispanic Other Race − Including Multi−Racial Yes No Yes No Borderline Yes No Yes N… view at source ↗

**Figure 16.** Figure 16: Categorical covariate-specific Split-CSL prediction intervals for three representative profiles after removing outliers. 31 view at source ↗

read the original abstract

The Super Learner (SL) is a widely used ensemble method that combines predictions from a library of learners based on their predictive performance. Interval predictions are of considerable practical interest because they allow uncertainty in predictions produced by an individual learner or an ensemble to be quantified. Several methods have been proposed for constructing interval predictions based on the SL, however, these approaches are typically justified using asymptotic arguments or rely on computationally intensive procedures such as the bootstrap. Conformal prediction (CP) is a machine learning framework for constructing prediction intervals with finite-sample and asymptotic coverage guarantees under mild conditions. We propose coupling CP with the SL through a natural construction that mirrors the original SL framework, using individual learner weights and combining learner-specific conformity scores via a weighted majority vote. We characterize the properties of the resulting SL-based prediction intervals for continuous outcomes. We cover settings under exchangeability, potential violations of exchangeability, and data-generating mechanisms exhibiting heteroscedasticity, sparsity, and other forms of distributional heterogeneity. A comprehensive simulation study shows that the conformalized SL achieves valid finite-sample coverage with competitive performance relative to the true data-generating mechanism. A central contribution of this work is an application to predicting creatinine levels using socio-demographic, biometric, and laboratory measurements. This example demonstrates the benefits of an ensemble with carefully selected learners designed to capture key aspects of complex regression functions, including non-linear effects, interactions, sparsity, heteroscedasticity, and robustness to outliers.R

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper wraps conformal prediction around the super learner with a weighted majority vote on per-learner conformity scores, but the finite-sample coverage claim rests on whether that aggregation preserves the necessary symmetry.

read the letter

The main point is a construction that takes the super learner weights and applies them to combine conformity scores from each base learner via weighted majority vote, then forms intervals from the resulting score. This is presented as a natural extension that mirrors how the super learner itself works, and the authors characterize the intervals under exchangeability plus some cases with heteroscedasticity and heterogeneity. They back it with simulations and a creatinine prediction example using socio-demographic and lab data. The simulations check coverage across varied data-generating processes and show the method stays competitive with the oracle. The real-data application illustrates how mixing learners that capture nonlinearity, interactions, and robustness can improve the ensemble in practice. That part is concrete and useful for applied settings. The soft spot is the coverage guarantee. Standard conformal prediction relies on the nonconformity score having a uniform rank among exchangeable calibration points. Here the weights come from cross-validation on the training data, so they are data-dependent, and the weighted vote is not obviously a symmetric function of the full set of points. The abstract claims a characterization, but if the proof treats the weights as fixed or does not explicitly handle the dependence, the finite-sample result does not transfer directly. That is the place where extra conditions or a revised argument would be needed. This work is aimed at statisticians and analysts who already use super learners for regression and want finite-sample intervals without heavy bootstrapping. The medical-style example makes it relevant for prediction tasks with mixed feature types. It deserves a serious referee because the idea is practical, the experiments are there, and the open theoretical question is well-defined enough for reviewers to address. I would send it to peer review with the expectation that the coverage argument gets the closest look.

Referee Report

1 major / 2 minor

Summary. The paper proposes coupling conformal prediction with the Super Learner (SL) ensemble by constructing prediction intervals via a weighted majority vote on learner-specific conformity scores, where weights are the standard SL cross-validated weights. It characterizes the resulting intervals' properties for continuous outcomes under exchangeability, potential violations thereof, heteroscedasticity, sparsity, and heterogeneity. A simulation study reports valid finite-sample coverage and competitive performance, and the method is applied to creatinine level prediction using socio-demographic, biometric, and lab data.

Significance. If the finite-sample coverage claim holds after addressing the data-dependent weights, the work would provide a practical, distribution-free way to obtain valid intervals for a widely used ensemble method without relying on asymptotics or bootstrap. The simulation and real-data example illustrate utility for complex regression settings involving nonlinearity, interactions, and outliers.

major comments (1)

[Theoretical characterization (properties under exchangeability)] The central finite-sample coverage claim under exchangeability rests on the weighted majority vote inheriting the uniform rank property of standard conformal scores. However, the SL weights are obtained via cross-validation on the training data (which overlaps with the points used to form conformity scores), breaking the symmetry required for the standard exchangeability argument. The characterization section must explicitly state the conditions (if any) under which coverage is preserved, or provide a modified proof that accounts for the dependence induced by the weights; without this, the guarantee does not transfer directly from individual learners.

minor comments (2)

[Abstract] The abstract states that the method 'achieves valid finite-sample coverage' in simulations but does not clarify whether the theoretical characterization delivers unconditional finite-sample guarantees or only asymptotic or conditional results.
[Simulation study] Simulation details on data exclusion rules, exact cross-validation scheme for SL weights, and how conformity scores are normalized across learners should be expanded for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript on Conformalized Super Learner. The feedback highlights an important subtlety in the theoretical characterization, which we address point by point below. We will revise the manuscript to strengthen the presentation of the coverage guarantees.

read point-by-point responses

Referee: The central finite-sample coverage claim under exchangeability rests on the weighted majority vote inheriting the uniform rank property of standard conformal scores. However, the SL weights are obtained via cross-validation on the training data (which overlaps with the points used to form conformity scores), breaking the symmetry required for the standard exchangeability argument. The characterization section must explicitly state the conditions (if any) under which coverage is preserved, or provide a modified proof that accounts for the dependence induced by the weights; without this, the guarantee does not transfer directly from individual learners.

Authors: We agree that the data-dependent nature of the Super Learner weights, derived from cross-validation on the training set, requires explicit treatment to ensure the exchangeability argument is rigorous. The original characterization implicitly relies on the full dataset being exchangeable, which preserves symmetry for the weighted scores in the marginal sense, but we acknowledge that a direct transfer from unweighted conformal scores needs clarification. In the revised manuscript, we will update the theoretical section to explicitly state the conditions: the finite-sample coverage guarantee holds exactly when weights are computed on a hold-out set disjoint from the conformity score points, or approximately when using the full training data under exchangeability (with the dependence not affecting the uniform rank property due to permutation invariance of the CV procedure). We will include a brief proof sketch or reference to supporting arguments from the conformal prediction literature on data-dependent scores, along with a discussion of finite-sample robustness as evidenced by the simulations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; coverage characterization is independent of fitted weights

full rationale

The paper defines a new aggregation of learner-specific conformity scores using SL weights (obtained via cross-validation) and claims to characterize finite-sample coverage under exchangeability for the resulting intervals. This characterization relies on standard conformal prediction rank-uniformity arguments applied to the aggregated nonconformity measure rather than redefining coverage in terms of the weights themselves. No equation reduces the coverage claim to a fitted quantity or prior self-citation; the construction is presented as a direct but non-tautological extension of existing CP and SL frameworks. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full derivations and any additional assumptions are not visible.

axioms (1)

domain assumption Data points are exchangeable or satisfy mild conditions allowing conformal prediction to achieve finite-sample coverage.
Abstract explicitly lists settings under exchangeability and mild conditions for the guarantees.

pith-pipeline@v0.9.0 · 5567 in / 1183 out tokens · 27627 ms · 2026-05-08T10:04:15.426407+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 4 canonical work pages

[1]

arXiv preprint arXiv:2411.11824 , year=

A.N. Angelopoulos, R.F. Barber, and S. Bates. Theoretical foundations of conformal prediction.arXiv preprint arXiv:2411.11824,

work page arXiv
[2]

Gasparin and A

M. Gasparin and A. Ramdas. Conformal online model aggregation.arXiv preprint arXiv:2403.15527, 2024a. M. Gasparin and A. Ramdas. Merging uncertainty sets via majority vote.arXiv preprint arXiv:2401.09379, 2024b. N. Gauraha and O. Spjuth. Synergy conformal prediction. InConformal and Probabilistic Prediction and Applications, pages 91–110. PMLR,

work page arXiv
[3]

Department of Health and Human Ser- vices, Centers for Disease Control and Prevention, 2021-2023, https://wwwn.cdc.gov/nchs/nhanes/ continuousnhanes/default.aspx?Cycle=2021-2023

Hyattsville, MD: U.S. Department of Health and Human Ser- vices, Centers for Disease Control and Prevention, 2021-2023, https://wwwn.cdc.gov/nchs/nhanes/ continuousnhanes/default.aspx?Cycle=2021-2023. J. Neeven and E. Smirnov. Conformal stacked weather forecasting. InConformal and Probabilistic Prediction and Applications, pages 220–233. PMLR,

2021
[4]

URLhttps://arxiv.org/abs/2405.16246. H. Papadopoulos, A. Gammerman, and V . V ovk. Normalized nonconformity measures for regression conformal prediction. InProceedings of the IASTED International Conference on Artificial Intelligence and Applications (AIA 2008), pages 64–69,

work page arXiv 2008
[5]

Z. Wu, F. Leisen, and F.J. Rubio. Conformalized regression for continuous bounded outcomes.arXiv preprint arXiv:2507.14023,

work page arXiv