Solve for the Hyperparameter, Skip the Search: Kolmogorov-Optimal Scaling Laws for Spline Regression

Kathleen A. Yearick; Yong Yi Bay

arxiv: 2606.23575 · v1 · pith:SOWARJJDnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI· cs.NA· math.NA· stat.ML

Solve for the Hyperparameter, Skip the Search: Kolmogorov-Optimal Scaling Laws for Spline Regression

Yong Yi Bay , Kathleen A. Yearick This is my paper

Pith reviewed 2026-06-26 08:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NAmath.NAstat.ML

keywords spline regressionhyperparameter tuningKolmogorov n-widthscaling lawsANOVA decompositionPRESS identityresolution estimationleave-one-out cross-validation

0 comments

The pith

Spline regression optimal resolution follows from a closed-form expression using Kolmogorov widths and a single-fit error estimate rather than grid search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the hyperparameter controlling spline resolution need not be found by trying many values and scoring them with cross-validation. Classical approximation theory fixes the squared bias as a precise power of resolution, the number of basis functions is a known polynomial in that resolution, and the leave-one-out error can be recovered from one fit via the PRESS identity. Balancing these two explicit curves produces the minimizing resolution directly. The same calculus extends to high-dimensional inputs by replacing ambient dimension with the order of active interactions in an ANOVA decomposition, producing scaling laws in effective sample density that contain no input dimension in the exponent. The resulting KORE procedure uses two pilot resolutions to calibrate the unknown scales and then evaluates the closed-form solution with a leave-one-out certificate.

Core claim

The optimal resolution G* is the analytic minimizer obtained by setting the derivative of estimated risk (Kolmogorov-powered bias plus PRESS-derived variance) to zero; when the input is replaced by interaction order r in the ANOVA decomposition the same minimizer yields power-law scaling of both optimal G and risk with effective density n divided by the number of active r-way terms, independent of ambient dimension d.

What carries the argument

The Kolmogorov n-width of the smoothness class, which supplies the exact power-law form of squared bias as a function of resolution G.

If this is right

Optimal resolution and risk become explicit power functions of sample size per active interaction component.
Ambient input dimension drops out of the exponent once interaction order is fixed.
KORE performs roughly eight times fewer fits than a full grid sweep while matching exhaustive 3-fold cross-validation.
On tabular data the method ranks first among twenty-one competitors in accuracy per unit compute.
The same analytic balance reproduces the ordering produced by GCV, Mallows Cp, AIC and BIC.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If approximation rates are known for other bases, the same two-pilot calibration could replace search for their resolution or bandwidth parameters.
The interaction-order reduction suggests that problems whose complexity lives in low-order terms remain tractable even when ambient dimension grows.
The leave-one-out certificate after the closed-form evaluation supplies a cheap consistency check that could be used to decide whether a third pilot fit is warranted.

Load-bearing premise

Squared bias equals exactly a known power of resolution G given by the Kolmogorov n-width of the smoothness class.

What would settle it

On data drawn from a function whose smoothness class is known in advance, compare the closed-form resolution against the resolution that actually minimizes out-of-sample error; a large systematic difference falsifies the claim.

Figures

Figures reproduced from arXiv: 2606.23575 by Kathleen A. Yearick, Yong Yi Bay.

**Figure 1.** Figure 1: Law-driven versus search-driven resolution selection. Cross-validation evaluates every grid candidate (clay dots). KORE fits two pilot resolutions, identifies the bias and variance scale constants, and reports the closed-form optimum (navy star). The dashed curves show how squared bias (falling) and variance (rising) compose the U-shaped error curve. Contributions. The paper makes four contributions. 1. An… view at source ↗

**Figure 2.** Figure 2: Effective-density collapse. The horizontal axis is ρ; the vertical axis is test RMSE at the KORE-selected G ⋆ . Panel (a) shows additive targets with ρ = n/d and reference slope ρ −4/9 . Panel (b) shows sparse pairwise targets with ρ = n/s and reference slope ρ −4/10. The four dimensions d ∈ {10, 20, 40, 80} collapse onto a single curve. Estimator and leave-one-out score. All methods share the same fitting… view at source ↗

**Figure 3.** Figure 3: Bias-scale recovery as predicted by Theorem 2: median and interquartile band of the ratio Abf /Af versus sample size n along a geometric ladder, with the population value 1 marked. to n = ρd ≤ 60,000. Sparse pairwise targets sweep ρ = n/s over {60, 90, 120, 180, 240, 360, 480, 720} at the same four dimensions with s = d/2 active pairs, subject to n = ρs ≤ 60,000. Every cell uses the seed rule of [PITH_FUL… view at source ↗

**Figure 4.** Figure 4: Noise-scale recovery as predicted by Theorem 2: median and interquartile band of τbf /τf versus sample size n along the same geometric ladder, with the population value 1 marked. 300 600 1200 2400 4800 9600 19200 sample size n 1.5 2.0 2.5 3.0 3.5 4.0 spline resolution Gc† f § 1¾ Gc† f mean G² f population [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Plug-in continuous optimizer Gb† f versus the anchored population target G • f along the sample-size ladder. Shaded band: ±1σ across 20 seeds; the diamond marker is the anchored target at the largest n. 4.4 CLOSED-FORM SELECTION ON THE ACCURACY-COMPUTE FRONTIER Setup. The frontier experiment tests whether the closed-form law translates into a practical accuracycompute win. The benchmarks are three additiv… view at source ↗

**Figure 6.** Figure 6: Selection cost across six controlled tasks (three additive, three sparse pairwise). Markers report the number of model fits used by KORE, the four full-grid criteria (GCV, Cp, AIC, BIC), and exhaustive 3-fold cross-validation. 0.96 0.97 0.98 0.99 1.00 1.01 1.02 1.03 1.04 KORE / CV RMSE Add d = 10 Add d = 20 Add d = 40 Pair d = 10 Pair d = 20 Pair d = 40 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Per-task accuracy parity on the same six tasks: ratio of KORE test RMSE to exhaustive 3-fold cross-validation, sorted by family. Bars at or below 1.0 favor the closed-form selector. differ only in scoring formula. KORE uses 9 to 14 fits because the law identifies the right neighborhood before any refinement begins [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Geometric-mean RMSE ratio versus exhaustive cross-validation, aggregated across all six controlled tasks, for the five selectors compared in this section. Bars to the left of 1.0 match or beat exhaustive search. 0.00 0.25 0.50 0.75 1.00 KORE / CV RMSE Nguyen-1 Nguyen-9 (2D add) Nguyen-7 Nguyen-5 Nguyen-4 SparseAdd-20D Friedman-1 (5D) SparsePair-10D Nguyen-10 (2D int) (a) smooth low-order benchmarks 0 5 10 … view at source ↗

**Figure 9.** Figure 9: Nine law-aligned benchmark equations. Panel (a) gives the per-task RMSE ratio of KORE against 3-fold CV. Panel (b) gives the corresponding fit-count reduction (CV fits divided by KORE fits). Panel (c) gives the geometric-mean RMSE ratio against CV across the nine tasks for the five selectors. BIC at 1.058). KORE is Pareto-dominant on this frontier, and it dominates because it replaces the full-grid pass wi… view at source ↗

**Figure 10.** Figure 10: Forest plot of mean test RMSE per benchmark across the full 12-equation suite, ranked by KORE / CV ratio. Markers compare KORE (closed-form), exhaustive 3-fold cross-validation, and GCV. Raw RMSE ratios are tabulated in [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Nemenyi critical-difference diagram on Compute-Normalized Lift over OLS, CNLα = max{0, max(0, R2 ) − max(0, R2 OLS)}/(1 + t) α at α = 1, across all 21 methods with complete five-seed coverage on every one of the 36 datasets. Mean rank lower is greater OLS-relative lift per unit compute; methods connected by a horizontal bar are statistically indistinguishable at αNemenyi = 0.05. OLS itself has CNL identi… view at source ↗

**Figure 12.** Figure 12: Per-method paired Wilcoxon signed-rank test of CNLKORE,d,c − CNLm,d,c against zero at α = 1, paired across (dataset, seed) and Holm-Bonferroni corrected over the 20-method family. Bars right of the dashed reference are statistically distinguishable from KORE at pHolm = 0.05; blue marks methods where KORE has the higher Compute-Normalized Lift over OLS, red marks the (none in this panel) where the competit… view at source ↗

**Figure 13.** Figure 13: Per-dataset Compute-Normalized Lift, KORE versus k-NN, on the full 36-dataset suite. The dashed diagonal is y = x. Most datasets lie above the diagonal: KORE wins on CNL on 19 of the 36 datasets, k-NN on 11, with six tied. intervals on the headline mean ranks. The top of the table reads KORE 4.33 [2.80, 6.03], kernel-ridge 5.34 [4.28, 6.46], k-NN 7.54 [5.46, 9.65], HistGradientBoosting 7.79 [6.64, 9.00], … view at source ↗

**Figure 14.** Figure 14: KORE mean Friedman rank stratified by training-set-size quartile (lower is better). The other 20 methods are shown as a faint backdrop; KORE is the focal series. 0 25 50 75 100 125 150 175 Post-one-hot dimension d 2.5 5.0 7.5 10.0 12.5 15.0 17.5 KORE Friedman rank (lower is better) d = 30 cutoff per dataset median in region [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: KORE per-dataset Friedman rank versus post-one-hot dimension d. The dashed vertical reference marks the pre-registered d = 30 cutoff. Median rank in each region is reported as a horizontal segment. τbf above the noise-floor variance and biases Gb† f downward (smoother fit) by a factor (1 + rh) −1/(2β+rf ) where rh = Var(σ 2 (X))/E[σ 2 (X)]2 is the noise-variance heterogeneity ratio. Heavy-tailed sub-expon… view at source ↗

**Figure 16.** Figure 16: Per-dataset Compute-Normalized Lift log-ratio against each classical spline resolution selector: each point is log10(CNLKORE/CNLcompetitor) on a single dataset. Points right of zero favor KORE; black diamonds mark medians; y-tick parentheticals count (KORE wins/datasets). Each row is one classical resolution selector and each point one dataset; the concentration of points to the right of zero, with median… view at source ↗

**Figure 17.** Figure 17: Efficiency-accuracy Pareto frontier on the smooth-low-d subset of 25 datasets (post-one-hot d ≤ 30). Both axes are ratios against KORE on log scales, so KORE sits at (1, 1). Family-colored markers; frontier methods full opacity, dominated methods recede. KORE sits at the elbow. returns the train mean in arbitrarily small wall time scores zero, and a method that copies OLS in arbitrarily small wall time al… view at source ↗

**Figure 18.** Figure 18: Sensitivity of the Compute-Normalized Lift verdict to the compute weight α in δm,d,c(α) = CNLα(KORE) − CNLα(m). Panel (a): paired Wilcoxon test at α = 1, Holm-Bonferroni corrected. Panel (b): count of competitors with significantly higher and significantly lower CNL than KORE as α sweeps {0, 0.25, 0.5, 1, 2}; at α = 0 (pure lift over OLS) the panel splits 9-9 between boosters that win on raw OLS-relative … view at source ↗

**Figure 19.** Figure 19: Empirical confirmation of Proposition 2. Empirical standard deviation of Abf (left) and τbf (right) across 100 noise replicates per training size n, against the 1/ √ n envelope predicted by the proposition. The deterministic target is a d = 10 additive sine-sum at fixed design and varying noise; the envelope prefactor is fit by least squares on the empirical points. Proposition 3 (Rate for the closed-form… view at source ↗

**Figure 20.** Figure 20: Residual graph discovery. Once the sample budget reaches roughly n/s ≥ 240, the discovered graph matches the oracle graph almost exactly and the resulting RMSE is indistinguishable from the oracle-graph model. C.3 ROBUSTNESS TO STRUCTURAL ASSUMPTIONS [PITH_FULL_IMAGE:figures/full_fig_p039_20.png] view at source ↗

**Figure 21.** Figure 21: Robustness across four conditions: Control (correct smooth low-order structure), 3-way (genuine 3-way interactions), Non-smooth (non-smooth target), and Wrong graph (approximate interaction graph). Numbers 12 and 24 denote input dimension d. Under Control, KORE matches exhaustive CV exactly; in the other three conditions it remains comparable. 20 40 60 80 dimension d 7.5 10.0 12.5 15.0 17.5 20.0 CV / KORE… view at source ↗

**Figure 22.** Figure 22: Scaling with input dimension from d = 10 to d = 80. Panel (a) shows the fit-count and wall-clock speedup of KORE against exhaustive cross-validation for additive (teal) and sparse pairwise (rose) families. Panel (b) shows the RMSE ratio of KORE against cross-validation, with the parity line at 1.0 for reference. Panel (a) shows the cost comparison. The solid lines count how many times more model fits cros… view at source ↗

**Figure 23.** Figure 23: Degree ablation in the interior-optimum regime (d = 20, 10% training noise, five seeds per cell). Each marker is the mean closed-form plug-in resolution Gb† as a function of effective density ρ = n/d. Dotted lines show the predicted power law ρ 1/(2β+1) at each spline degree, with β = k + 1. (Section 4.3), where the two pilots identify the bias-variance balance directly rather than the stability cap, so t… view at source ↗

**Figure 24.** Figure 24: Per-dataset Compute-Normalized Lift ratio against KORE on the full 36-dataset OpenML-CTR23 plus UCI suite, for the four strongest CNL competitors (k-NN, kernel ridge, BIC-tuned splines, GCV-tuned splines). Markers are medians across five seeds; markers right of 1× favor KORE, markers left favor the competitor. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_24.png] view at source ↗

**Figure 25.** Figure 25: Per-dataset Compute-Normalized Lift ratio on the smooth-low-d subset (post-one-hot dimension at most 30), restricted to the regime in which the closed-form law is calibrated. Same markers and conventions as [PITH_FULL_IMAGE:figures/full_fig_p045_25.png] view at source ↗

**Figure 26.** Figure 26: Per-cell peak resident-set size on the real-world benchmark, one row per method: the line runs from the method’s median (open marker) to its maximum (filled marker) across all datasets, sorted by the maximum and colored by family, with the soft 8 GiB per-cell cap as the dashed reference. Every method’s median cell is small; only the classical full-grid spline selectors carry a tail past the cap on the hig… view at source ↗

read the original abstract

Hyperparameter tuning almost always means search: fit the model at every value on a grid, score each by cross-validation, and keep the winner. For spline regression that search is unnecessary. The optimal resolution can be solved for in closed form, to the accuracy an exhaustive search reaches, at a fraction of the compute. Three ingredients make this possible: classical approximation theory pins the squared bias to a known power of the resolution G, exactly the Kolmogorov n-width of the smoothness class; the basis dimension is an explicit polynomial in G; and leave-one-out error follows from a single fit via the PRESS identity. Balancing the two known curves gives the minimizer analytically. We extend this calculus to many coordinates by replacing ambient input dimension with interaction order, the number of active low-order components in an ANOVA decomposition, yielding a scaling law in which the optimal resolution and error are power functions of the effective density (sample size per active component), with input dimension absent from the exponent. The law becomes an algorithm. KORE (Kolmogorov-optimal Order-aware Resolution Estimation) fits two pilot resolutions, solves a leverage-calibrated 2x2 system for the bias and noise scales, and evaluates the closed-form plug-in resolution with a tiny leave-one-out certificate: about a dozen fits instead of a full grid sweep, with a consistency guarantee as the sample grows. Across additive and sparse pairwise targets up to 80 input dimensions, KORE matches exhaustive 3-fold cross-validation and the full classical ladder (GCV, Mallows' Cp, AIC, BIC) while fitting roughly 8x fewer models; on 36 real tabular datasets it ranks first among 21 methods in accuracy per unit of compute, ahead of tuned boosters and kernel machines. When complexity lives in low interaction order, solving for the resolution beats searching for it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KORE solves for spline resolution in closed form from n-width bias plus PRESS, matching CV with far less compute on the tested cases, but the exact power-law bias assumption is the part that needs checking.

read the letter

The main claim is that you can derive the optimal resolution G for splines directly instead of searching a grid with cross-validation. They fix squared bias to a power of G from the Kolmogorov n-width of the smoothness class, use the explicit polynomial dimension of the basis, and apply the PRESS identity for the variance term, then balance the two to get an analytic minimizer. KORE implements this by fitting two pilot resolutions, solving a 2x2 system for the bias and noise scales, and plugging in, with a small leave-one-out check afterward. For high dimensions they replace ambient dimension with interaction order from an ANOVA decomposition, so the scaling law depends on effective density per active component.

This combination and the resulting algorithm look new relative to the classical approximation theory cited. On the reported tests it matches 3-fold CV and the usual criteria (GCV, Cp, AIC, BIC) for additive and sparse pairwise targets up to 80 dimensions, and on 36 real tabular sets it comes out ahead of tuned boosters and kernels when measured by accuracy per fit. The 8x reduction in models evaluated is a clear practical point if the numbers hold.

The soft spot is the assumption that squared bias equals exactly θ G^{-α} with α from the n-width and no lower-order terms or G-dependent constants. Kolmogorov n-width supplies minimax rates over a function class, typically asymptotic, so for concrete splines on a fixed target the realized bias can deviate. If the two-pilot calibration recovers the wrong scales, the plug-in G* will miss the true risk minimum. That step carries every later claim about matching exhaustive search and dimension-independent scaling. The circularity from estimating scales on the evaluation data is secondary but real.

This is for researchers in nonparametric regression or practitioners tuning splines on tabular data who care about reducing search cost when interaction order is low. The work is grounded enough in classical results and has direct empirical comparisons, so it deserves a serious referee even if the bias assumption needs tighter justification or more validation.

Referee Report

2 major / 2 minor

Summary. The paper claims that for spline regression the optimal resolution G can be obtained in closed form by balancing an exact power-law squared-bias term taken from the Kolmogorov n-width of the smoothness class against an explicit polynomial basis dimension and a PRESS-derived variance term; the resulting KORE algorithm estimates the two unknown scales from two pilot fits, plugs the closed-form G* into the model, and achieves accuracy comparable to exhaustive 3-fold CV or classical criteria (GCV, Cp, AIC, BIC) while using roughly 8× fewer fits. The method is extended to high-dimensional additive and sparse pairwise targets by replacing ambient dimension with interaction order, producing scaling laws in which optimal resolution and risk depend only on effective density.

Significance. If the analytic minimizer is reliable, the work supplies a low-compute, theoretically grounded alternative to grid search for spline hyper-parameters and a dimension-free scaling law for models whose complexity is governed by low-order interactions. The empirical ranking on 36 real tabular data sets and the consistency guarantee as n grows are concrete strengths.

major comments (2)

[§3] §3 (derivation of closed-form G*): the central step equates squared bias exactly to θ G^{-α} with α taken from the Kolmogorov n-width of the smoothness class. Kolmogorov n-width supplies only the asymptotic minimax rate; for a concrete spline subspace and fixed target the realized bias can contain G-dependent constants, log factors, or slower decay. Because the subsequent plug-in G* and all scaling-law claims rest on this exact power-law identity, the manuscript must either derive a bound showing that lower-order terms do not change the location of the minimizer or demonstrate empirically that the recovered G* remains within a small relative error of the true risk minimizer across the tested regimes.
[§4.2] §4.2 (two-pilot calibration): the 2×2 leverage-calibrated system solves for bias and noise scales from two pilot resolutions fitted to the same data later used for evaluation. While the paper states a consistency guarantee as n→∞, for finite samples the procedure is partly data-dependent; the manuscript should report the sensitivity of the final G* to the choice of the two pilot resolutions and quantify how often the plug-in G* deviates from the CV optimum by more than a stated tolerance on the synthetic suite.

minor comments (2)

[Abstract / §5] The abstract states that KORE “matches exhaustive 3-fold cross-validation,” yet the main text does not tabulate the distribution of |G_KORE – G_CV| or the relative excess risk; adding a supplementary table with these quantities (mean, median, 90th percentile) would strengthen the empirical claim.
[§4] Notation for the interaction order and effective density is introduced without an explicit forward reference; a short definitional paragraph early in §4 would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our theoretical claims and the need for additional empirical safeguards. We address each major point below and will incorporate revisions to strengthen the presentation of the asymptotic justification and the finite-sample behavior of the two-pilot procedure.

read point-by-point responses

Referee: [§3] §3 (derivation of closed-form G*): the central step equates squared bias exactly to θ G^{-α} with α taken from the Kolmogorov n-width of the smoothness class. Kolmogorov n-width supplies only the asymptotic minimax rate; for a concrete spline subspace and fixed target the realized bias can contain G-dependent constants, log factors, or slower decay. Because the subsequent plug-in G* and all scaling-law claims rest on this exact power-law identity, the manuscript must either derive a bound showing that lower-order terms do not change the location of the minimizer or demonstrate empirically that the recovered G* remains within a small relative error of the true risk minimizer across the tested regimes.

Authors: We agree that the Kolmogorov n-width result is asymptotic. The manuscript invokes the leading-term rate because, for the Sobolev-type classes and spline spaces considered, standard approximation-theory bounds show that the bias is θ G^{-α} (1 + o(1)) uniformly over the function class once G exceeds a modest threshold; the location of the risk minimizer is therefore insensitive to lower-order terms for the sample sizes and resolutions used in the experiments. Nevertheless, to make this explicit we will revise §3 to state the asymptotic character of the identity, add a short appendix deriving a sufficient condition under which the o(1) term cannot shift the argmin by more than a relative factor of 1+ε, and include an empirical panel (new Figure) that plots, for each synthetic regime, the relative error |G*_KORE - G*_CV| / G*_CV together with the fraction of trials in which this error exceeds 10 %. These additions will be marked as addressing the referee’s concern. revision: yes
Referee: [§4.2] §4.2 (two-pilot calibration): the 2×2 leverage-calibrated system solves for bias and noise scales from two pilot resolutions fitted to the same data later used for evaluation. While the paper states a consistency guarantee as n→∞, for finite samples the procedure is partly data-dependent; the manuscript should report the sensitivity of the final G* to the choice of the two pilot resolutions and quantify how often the plug-in G* deviates from the CV optimum by more than a stated tolerance on the synthetic suite.

Authors: We will expand §4.2 with a sensitivity table that varies the two pilot resolutions over a grid of plausible values (e.g., G1 ∈ {5,10,15}, G2 ∈ {20,30,40}) and reports, for each synthetic configuration, both the standard deviation of the resulting G* and the empirical frequency with which |G*_KORE - G*_CV| / G*_CV > 0.15. The same panel will also show the distribution of the leave-one-out certificate error. These diagnostics will be added as a new subsection and will be referenced in the main text when the two-pilot procedure is introduced. revision: yes

Circularity Check

0 steps flagged

No significant circularity; closed-form follows from external Kolmogorov n-width, explicit dimension, and PRESS identity

full rationale

The derivation balances an assumed exact power-law bias term (Kolmogorov n-width of the smoothness class, an external result from approximation theory), an explicit polynomial expression for basis dimension in G, and the standard PRESS identity for leave-one-out variance. KORE estimates the two scale constants θ and σ² from two pilot fits via a 2×2 system and plugs them into the already-derived analytic minimizer; this is calibration to apply the formula rather than a fitted input renamed as prediction. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The central claim therefore remains independent of its own outputs and is self-contained against the stated external ingredients and identities.

Axiom & Free-Parameter Ledger

2 free parameters · 3 axioms · 0 invented entities

The central claim rests on three classical results (Kolmogorov n-width bias power, polynomial basis dimension in G, PRESS identity for LOO) plus the modeling choice that interaction order replaces ambient dimension; the two pilot fits introduce fitted bias and noise scales that are not supplied by prior literature.

free parameters (2)

bias scale
Estimated by solving the 2x2 system from the two pilot resolutions in KORE
noise scale
Estimated by solving the 2x2 system from the two pilot resolutions in KORE

axioms (3)

domain assumption squared bias equals a known power of resolution G exactly equal to the Kolmogorov n-width of the smoothness class
Invoked to pin bias to a known function of G
standard math basis dimension is an explicit polynomial in G
Required for the explicit variance term
standard math leave-one-out error is obtained from a single fit via the PRESS identity
Enables analytic variance curve without refitting

pith-pipeline@v0.9.1-grok · 5882 in / 1688 out tokens · 38655 ms · 2026-06-26T08:49:39.284801+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

No 3D Matrices: A Unified Tensor-Product View of Matrix-Free Cartesian PDE Solvers
math.NA 2026-06 unverdicted novelty 2.0

Cartesian 3D PDE operators factor exactly into 1D line kernels via Kronecker algebra, yielding O(N) cost and O(Nx+Ny+Nz) storage for any fixed stencil or polynomial degree.

Reference graph

Works this paper leans on

83 extracted references · 27 canonical work pages · cited by 1 Pith paper

[1]

, title =

Allen, David M. , title =. Technometrics , volume =. 1974 , doi =

1974
[3]

Numerische Mathematik , volume =

Craven, Peter and Wahba, Grace , title =. Numerische Mathematik , volume =. 1979 , doi =

1979
[4]

de Boor, Carl , title =
[5]

, title =

Friedman, Jerome H. , title =. The Annals of Statistics , volume =. 1991 , doi =

1991
[6]

and Heath, Michael and Wahba, Grace , title =

Golub, Gene H. and Heath, Michael and Wahba, Grace , title =. Technometrics , volume =. 1979 , doi =

1979
[7]

2013 , doi =

Gu, Chong , title =. 2013 , doi =

2013
[8]

and Tibshirani, Robert J

Hastie, Trevor J. and Tibshirani, Robert J. , title =
[9]

, title =

Huang, Jianhua Z. , title =. The Annals of Statistics , volume =. 1998 , doi =

1998
[10]

Journal of the Royal Statistical Society: Series B , volume =

Ravikumar, Pradeep and Lafferty, John and Liu, Han and Wasserman, Larry , title =. Journal of the Royal Statistical Society: Series B , volume =. 2009 , doi =

2009
[11]

, title =

Mallows, Colin L. , title =. Technometrics , volume =. 1973 , doi =

1973
[12]

IEEE Transactions on Automatic Control , volume =

Akaike, Hirotugu , title =. IEEE Transactions on Automatic Control , volume =. 1974 , doi =

1974
[13]

The Annals of Statistics , volume =

Schwarz, Gideon , title =. The Annals of Statistics , volume =. 1978 , doi =

1978
[14]

Journal of the Royal Statistical Society: Series B , volume =

Stone, Mervyn , title =. Journal of the Royal Statistical Society: Series B , volume =. 1974 , doi =

1974
[15]

, title =

Schumaker, Larry L. , title =. 2007 , doi =

2007
[16]

, title =

Kolmogorov, Andrey N. , title =. Annals of Mathematics , volume =
[17]

1985 , doi =

Pinkus, Allan , title =. 1985 , doi =

1985
[18]

and Micchelli, Charles A

Melkman, Avraham A. and Micchelli, Charles A. , title =. Illinois Journal of Mathematics , volume =. 1978 , doi =

1978
[19]

, title =

Stone, Charles J. , title =. The Annals of Statistics , volume =. 1982 , doi =

1982
[20]

, title =

Stone, Charles J. , title =. The Annals of Statistics , volume =. 1985 , doi =

1985
[21]

, title =

Wood, Simon N. , title =. 2017 , doi =

2017
[23]

Advances in Neural Information Processing Systems , volume =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and others , title =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

2022
[24]

AutoML Conference 2023 (Workshop Track) , year =

Fischer, Sebastian Felix and Feurer, Matthias and Bischl, Bernd , title =. AutoML Conference 2023 (Workshop Track) , year =

2023
[25]

Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods , journal =

T. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods , journal =. 2014 , doi =

2014
[26]

pyGAM: Generalized Additive Models in

Serv. pyGAM: Generalized Additive Models in. Zenodo , year =
[27]

Why do tree-based models still outperform deep learning on typical tabular data? , journal =

Grinsztajn, L. Why do tree-based models still outperform deep learning on typical tabular data? , journal =. 2022 , url =

2022
[28]

, title =

Murphy, Allan H. , title =. Monthly Weather Review , volume =. 1988 , url =

1988
[29]

1990 , doi =

Wahba, Grace , title =. 1990 , doi =

1990
[30]

Eilers, Paul H. C. and Marx, Brian D. , title =. Statistical Science , volume =. 1996 , doi =

1996
[31]

, title =

Wood, Simon N. , title =. Journal of the Royal Statistical Society: Series B , volume =. 2003 , doi =

2003
[32]

, title =

Marra, Giampiero and Wood, Simon N. , title =. Computational Statistics and Data Analysis , volume =. 2011 , doi =

2011
[33]

The Annals of Statistics , volume =

Lin, Yi and Zhang, Hao Helen , title =. The Annals of Statistics , volume =. 2006 , doi =

2006
[34]

, title =

Radchenko, Peter and James, Gareth M. , title =. Journal of the American Statistical Association , volume =. 2010 , doi =

2010
[35]

The Annals of Statistics , volume =

Bien, Jacob and Taylor, Jonathan and Tibshirani, Robert , title =. The Annals of Statistics , volume =. 2013 , doi =

2013
[36]

, title =

Agarwal, Rishabh and Melnick, Levi and Frosst, Nicholas and Zhang, Xuezhou and Lengerich, Ben and Caruana, Rich and Hinton, Geoffrey E. , title =. Advances in Neural Information Processing Systems , volume =. 2021 , url =

2021
[37]

International Conference on Learning Representations , year =

Chang, Chun-Hao and Caruana, Rich and Goldenberg, Anna , title =. International Conference on Learning Representations , year =
[38]

Journal of Machine Learning Research , volume =

Bergstra, James and Bengio, Yoshua , title =. Journal of Machine Learning Research , volume =. 2012 , url =

2012
[39]

Proceedings of the 35th International Conference on Machine Learning , series =

Falkner, Stefan and Klein, Aaron and Hutter, Frank , title =. Proceedings of the 35th International Conference on Machine Learning , series =. 2018 , url =

2018
[40]

Gijsbers, Pieter and Bueno, Marcos L. P. and Coors, Stefan and LeDell, Erin and Poirier, S. Journal of Machine Learning Research , volume =. 2024 , url =

2024
[42]

Statistical comparisons of classifiers over multiple data sets , journal =

Dem. Statistical comparisons of classifiers over multiple data sets , journal =. 2006 , url =

2006
[43]

How far are automatically chosen regression smoothing parameters from their optimum? , journal =

H. How far are automatically chosen regression smoothing parameters from their optimum? , journal =. 1988 , doi =

1988
[44]

Smoothing Spline ANOVA Models

Chong Gu. Smoothing Spline ANOVA Models . Springer, 2nd edition, 2013. doi:10.1007/978-1-4614-5369-7

work page doi:10.1007/978-1-4614-5369-7 2013
[45]

A Practical Guide to Splines, volume 27 of Applied Mathematical Sciences

Carl de Boor. A Practical Guide to Splines, volume 27 of Applied Mathematical Sciences. Springer, revised edition, 2001

2001
[46]

Component selection and smoothing in multivariate nonparametric regression

Yi Lin and Hao Helen Zhang. Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34 0 (5): 0 2272--2297, 2006. doi:10.1214/009053606000000722

work page doi:10.1214/009053606000000722 2006
[47]

Yong Yi Bay and Kathleen A. Yearick. Machine learning vs deep learning: The generalization problem. arXiv preprint arXiv:2403.01621, 2024

arXiv 2024
[48]

A lasso for hierarchical interactions

Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. The Annals of Statistics, 41 0 (3): 0 1111--1141, 2013. doi:10.1214/13-AOS1096

work page doi:10.1214/13-aos1096 2013
[49]

Simon N. Wood. Generalized Additive Models: An Introduction with R . Chapman and Hall/CRC, 2nd edition, 2017. doi:10.1201/9781315370279

work page doi:10.1201/9781315370279 2017
[50]

Simon N. Wood. Thin plate regression splines. Journal of the Royal Statistical Society: Series B, 65 0 (1): 0 95--114, 2003. doi:10.1111/1467-9868.00374

work page doi:10.1111/1467-9868.00374 2003
[51]

Golub, Michael Heath, and Grace Wahba

Gene H. Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21 0 (2): 0 215--223, 1979. doi:10.1080/00401706.1979.10489751

work page doi:10.1080/00401706.1979.10489751 1979
[52]

NODE-GAM : Neural generalized additive model for interpretable deep learning

Chun-Hao Chang, Rich Caruana, and Anna Goldenberg. NODE-GAM : Neural generalized additive model for interpretable deep learning. In International Conference on Learning Representations, 2022. URL https://arxiv.org/abs/2106.01613

arXiv 2022
[53]

Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics

Grace Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, 1990. doi:10.1137/1.9781611970128

work page doi:10.1137/1.9781611970128 1990
[54]

David M. Allen. The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16 0 (1): 0 125--127, 1974. doi:10.1080/00401706.1974.10489157

work page doi:10.1080/00401706.1974.10489157 1974
[55]

Cross-validatory choice and assessment of statistical predictions

Mervyn Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B, 36 0 (2): 0 111--133, 1974. doi:10.1111/j.2517-6161.1974.tb00994.x

work page doi:10.1111/j.2517-6161.1974.tb00994.x 1974
[56]

Charles J. Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, 10 0 (4): 0 1040--1053, 1982. doi:10.1214/aos/1176345969

work page doi:10.1214/aos/1176345969 1982
[57]

Charles J. Stone. Additive regression and other nonparametric models. The Annals of Statistics, 13 0 (2): 0 689--705, 1985. doi:10.1214/aos/1176349548

work page doi:10.1214/aos/1176349548 1985
[58]

Jianhua Z. Huang. Projection estimation in multiple regression with application to functional ANOVA models. The Annals of Statistics, 26 0 (1): 0 242--272, 1998. doi:10.1214/aos/1030563984

work page doi:10.1214/aos/1030563984 1998
[59]

Giampiero Marra and Simon N. Wood. Practical variable selection for generalized additive models. Computational Statistics and Data Analysis, 55 0 (7): 0 2372--2387, 2011. doi:10.1016/j.csda.2011.02.004

work page doi:10.1016/j.csda.2011.02.004 2011
[60]

Paul H. C. Eilers and Brian D. Marx. Flexible smoothing with B -splines and penalties. Statistical Science, 11 0 (2): 0 89--121, 1996. doi:10.1214/ss/1038425655

work page doi:10.1214/ss/1038425655 1996
[61]

n-Widths in Approximation Theory

Allan Pinkus. n-Widths in Approximation Theory. Ergebnisse der Mathematik und ihrer Grenzgebiete. Springer, 1985. doi:10.1007/978-3-642-69894-1

work page doi:10.1007/978-3-642-69894-1 1985
[62]

Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation

Peter Craven and Grace Wahba. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31 0 (4): 0 377--403, 1979. doi:10.1007/BF01404567

work page doi:10.1007/bf01404567 1979
[63]

Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

Pith/arXiv arXiv 2001
[64]

Allan H. Murphy. Skill scores based on the mean square error and their relationships to the correlation coefficient. Monthly Weather Review, 116 0 (12): 0 2417--2424, 1988. URL https://journals.ametsoc.org/view/journals/mwre/116/12/1520-0493_1988_116_2417_ssbotm_2_0_co_2.xml

1988
[65]

doi:10.5281/zenodo.1208723 , url =

Daniel Serv \'e n and Charlie Brummitt. pygam: Generalized additive models in Python . Zenodo, 2018. doi:10.5281/zenodo.1208723

work page doi:10.5281/zenodo.1208723 2018
[66]

Hastie and Robert J

Trevor J. Hastie and Robert J. Tibshirani. Generalized Additive Models. Chapman and Hall, 1990

1990
[67]

IEEE Transactions on Automatic Control , volume =

Hirotugu Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19 0 (6): 0 716--723, 1974. doi:10.1109/TAC.1974.1100705

work page doi:10.1109/tac.1974.1100705 1974
[68]

Wolfgang H \"a rdle, Peter Hall, and James S. Marron. How far are automatically chosen regression smoothing parameters from their optimum? Journal of the American Statistical Association, 83 0 (401): 0 86--95, 1988. doi:10.1080/01621459.1988.10478568

work page doi:10.1080/01621459.1988.10478568 1988
[69]

Statistical comparisons of classifiers over multiple data sets

Janez Dem s ar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7: 0 1--30, 2006. URL https://jmlr.org/papers/v7/demsar06a.html

2006
[70]

Colin L. Mallows. Some comments on C_P . Technometrics, 15 0 (4): 0 661--675, 1973. doi:10.1080/00401706.1973.10489103

work page doi:10.1080/00401706.1973.10489103 1973
[71]

The Annals of Statistics , author =

Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6 0 (2): 0 461--464, 1978. doi:10.1214/aos/1176344136

work page doi:10.1214/aos/1176344136 1978
[72]

BOHB : Robust and efficient hyperparameter optimization at scale

Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB : Robust and efficient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 1437--1446. PMLR, 2018. URL https://proceedings.mlr.press/v80/falkner18a.html

2018
[73]

Rishabh Agarwal, Levi Melnick, Nicholas Frosst, Xuezhou Zhang, Ben Lengerich, Rich Caruana, and Geoffrey E. Hinton. Neural additive models: Interpretable machine learning with neural nets. In Advances in Neural Information Processing Systems, volume 34, 2021. URL https://arxiv.org/abs/2004.13912

arXiv 2021
[74]

Melkman and Charles A

Avraham A. Melkman and Charles A. Micchelli. Spline spaces are optimal for L^2 n -width. Illinois Journal of Mathematics, 22 0 (4): 0 541--564, 1978. doi:10.1215/ijm/1256048466

work page doi:10.1215/ijm/1256048466 1978
[75]

OpenML-CTR23 : A curated tabular regression benchmarking suite

Sebastian Felix Fischer, Matthias Feurer, and Bernd Bischl. OpenML-CTR23 : A curated tabular regression benchmarking suite. In AutoML Conference 2023 (Workshop Track), 2023. URL https://openreview.net/forum?id=HebAOoMm94

2023
[76]

AutoGluon-Tabular : Robust and accurate AutoML for structured data

Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. AutoGluon-Tabular : Robust and accurate AutoML for structured data. arXiv preprint arXiv:2003.06505, 2020

Pith/arXiv arXiv 2003
[77]

Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods

P nar T \"u fek c i. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power & Energy Systems, 60: 0 126--140, 2014. doi:10.1016/j.ijepes.2014.02.027

work page doi:10.1016/j.ijepes.2014.02.027 2014
[78]

Random search for hyper-parameter optimization

James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13: 0 281--305, 2012. URL https://jmlr.org/papers/v13/bergstra12a.html

2012
[79]

Pieter Gijsbers, Marcos L. P. Bueno, Stefan Coors, Erin LeDell, S \'e bastien Poirier, Janek Thomas, Bernd Bischl, and Joaquin Vanschoren. AMLB : an AutoML benchmark. Journal of Machine Learning Research, 25 0 (101): 0 1--65, 2024. URL https://jmlr.org/papers/v25/22-0493.html

2024
[80]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. Advances in Neural Information Processing Systems, 35, 2022. URL https://arxiv.org/abs/2203.15556

Pith/arXiv arXiv 2022
[81]

, year =

Jerome H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19 0 (1): 0 1--67, 1991. doi:10.1214/aos/1176347963

work page doi:10.1214/aos/1176347963 1991
[82]

Peter Radchenko and Gareth M. James. Variable selection using adaptive nonlinear interaction structures in high dimensions. Journal of the American Statistical Association, 105 0 (492): 0 1541--1553, 2010. doi:10.1198/jasa.2010.tm10130

work page doi:10.1198/jasa.2010.tm10130 2010
[83]

Schumaker

Larry L. Schumaker. Spline Functions: Basic Theory. Cambridge University Press, 3rd edition, 2007. doi:10.1017/CBO9780511618994

work page doi:10.1017/cbo9780511618994 2007

Showing first 80 references.

[1] [1]

, title =

Allen, David M. , title =. Technometrics , volume =. 1974 , doi =

1974

[2] [3]

Numerische Mathematik , volume =

Craven, Peter and Wahba, Grace , title =. Numerische Mathematik , volume =. 1979 , doi =

1979

[3] [4]

de Boor, Carl , title =

[4] [5]

, title =

Friedman, Jerome H. , title =. The Annals of Statistics , volume =. 1991 , doi =

1991

[5] [6]

and Heath, Michael and Wahba, Grace , title =

Golub, Gene H. and Heath, Michael and Wahba, Grace , title =. Technometrics , volume =. 1979 , doi =

1979

[6] [7]

2013 , doi =

Gu, Chong , title =. 2013 , doi =

2013

[7] [8]

and Tibshirani, Robert J

Hastie, Trevor J. and Tibshirani, Robert J. , title =

[8] [9]

, title =

Huang, Jianhua Z. , title =. The Annals of Statistics , volume =. 1998 , doi =

1998

[9] [10]

Journal of the Royal Statistical Society: Series B , volume =

Ravikumar, Pradeep and Lafferty, John and Liu, Han and Wasserman, Larry , title =. Journal of the Royal Statistical Society: Series B , volume =. 2009 , doi =

2009

[10] [11]

, title =

Mallows, Colin L. , title =. Technometrics , volume =. 1973 , doi =

1973

[11] [12]

IEEE Transactions on Automatic Control , volume =

Akaike, Hirotugu , title =. IEEE Transactions on Automatic Control , volume =. 1974 , doi =

1974

[12] [13]

The Annals of Statistics , volume =

Schwarz, Gideon , title =. The Annals of Statistics , volume =. 1978 , doi =

1978

[13] [14]

Journal of the Royal Statistical Society: Series B , volume =

Stone, Mervyn , title =. Journal of the Royal Statistical Society: Series B , volume =. 1974 , doi =

1974

[14] [15]

, title =

Schumaker, Larry L. , title =. 2007 , doi =

2007

[15] [16]

, title =

Kolmogorov, Andrey N. , title =. Annals of Mathematics , volume =

[16] [17]

1985 , doi =

Pinkus, Allan , title =. 1985 , doi =

1985

[17] [18]

and Micchelli, Charles A

Melkman, Avraham A. and Micchelli, Charles A. , title =. Illinois Journal of Mathematics , volume =. 1978 , doi =

1978

[18] [19]

, title =

Stone, Charles J. , title =. The Annals of Statistics , volume =. 1982 , doi =

1982

[19] [20]

, title =

Stone, Charles J. , title =. The Annals of Statistics , volume =. 1985 , doi =

1985

[20] [21]

, title =

Wood, Simon N. , title =. 2017 , doi =

2017

[21] [23]

Advances in Neural Information Processing Systems , volume =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and others , title =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

2022

[22] [24]

AutoML Conference 2023 (Workshop Track) , year =

Fischer, Sebastian Felix and Feurer, Matthias and Bischl, Bernd , title =. AutoML Conference 2023 (Workshop Track) , year =

2023

[23] [25]

Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods , journal =

T. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods , journal =. 2014 , doi =

2014

[24] [26]

pyGAM: Generalized Additive Models in

Serv. pyGAM: Generalized Additive Models in. Zenodo , year =

[25] [27]

Why do tree-based models still outperform deep learning on typical tabular data? , journal =

Grinsztajn, L. Why do tree-based models still outperform deep learning on typical tabular data? , journal =. 2022 , url =

2022

[26] [28]

, title =

Murphy, Allan H. , title =. Monthly Weather Review , volume =. 1988 , url =

1988

[27] [29]

1990 , doi =

Wahba, Grace , title =. 1990 , doi =

1990

[28] [30]

Eilers, Paul H. C. and Marx, Brian D. , title =. Statistical Science , volume =. 1996 , doi =

1996

[29] [31]

, title =

Wood, Simon N. , title =. Journal of the Royal Statistical Society: Series B , volume =. 2003 , doi =

2003

[30] [32]

, title =

Marra, Giampiero and Wood, Simon N. , title =. Computational Statistics and Data Analysis , volume =. 2011 , doi =

2011

[31] [33]

The Annals of Statistics , volume =

Lin, Yi and Zhang, Hao Helen , title =. The Annals of Statistics , volume =. 2006 , doi =

2006

[32] [34]

, title =

Radchenko, Peter and James, Gareth M. , title =. Journal of the American Statistical Association , volume =. 2010 , doi =

2010

[33] [35]

The Annals of Statistics , volume =

Bien, Jacob and Taylor, Jonathan and Tibshirani, Robert , title =. The Annals of Statistics , volume =. 2013 , doi =

2013

[34] [36]

, title =

Agarwal, Rishabh and Melnick, Levi and Frosst, Nicholas and Zhang, Xuezhou and Lengerich, Ben and Caruana, Rich and Hinton, Geoffrey E. , title =. Advances in Neural Information Processing Systems , volume =. 2021 , url =

2021

[35] [37]

International Conference on Learning Representations , year =

Chang, Chun-Hao and Caruana, Rich and Goldenberg, Anna , title =. International Conference on Learning Representations , year =

[36] [38]

Journal of Machine Learning Research , volume =

Bergstra, James and Bengio, Yoshua , title =. Journal of Machine Learning Research , volume =. 2012 , url =

2012

[37] [39]

Proceedings of the 35th International Conference on Machine Learning , series =

Falkner, Stefan and Klein, Aaron and Hutter, Frank , title =. Proceedings of the 35th International Conference on Machine Learning , series =. 2018 , url =

2018

[38] [40]

Gijsbers, Pieter and Bueno, Marcos L. P. and Coors, Stefan and LeDell, Erin and Poirier, S. Journal of Machine Learning Research , volume =. 2024 , url =

2024

[39] [42]

Statistical comparisons of classifiers over multiple data sets , journal =

Dem. Statistical comparisons of classifiers over multiple data sets , journal =. 2006 , url =

2006

[40] [43]

How far are automatically chosen regression smoothing parameters from their optimum? , journal =

H. How far are automatically chosen regression smoothing parameters from their optimum? , journal =. 1988 , doi =

1988

[41] [44]

Smoothing Spline ANOVA Models

Chong Gu. Smoothing Spline ANOVA Models . Springer, 2nd edition, 2013. doi:10.1007/978-1-4614-5369-7

work page doi:10.1007/978-1-4614-5369-7 2013

[42] [45]

A Practical Guide to Splines, volume 27 of Applied Mathematical Sciences

Carl de Boor. A Practical Guide to Splines, volume 27 of Applied Mathematical Sciences. Springer, revised edition, 2001

2001

[43] [46]

Component selection and smoothing in multivariate nonparametric regression

Yi Lin and Hao Helen Zhang. Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34 0 (5): 0 2272--2297, 2006. doi:10.1214/009053606000000722

work page doi:10.1214/009053606000000722 2006

[44] [47]

Yong Yi Bay and Kathleen A. Yearick. Machine learning vs deep learning: The generalization problem. arXiv preprint arXiv:2403.01621, 2024

arXiv 2024

[45] [48]

A lasso for hierarchical interactions

Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. The Annals of Statistics, 41 0 (3): 0 1111--1141, 2013. doi:10.1214/13-AOS1096

work page doi:10.1214/13-aos1096 2013

[46] [49]

Simon N. Wood. Generalized Additive Models: An Introduction with R . Chapman and Hall/CRC, 2nd edition, 2017. doi:10.1201/9781315370279

work page doi:10.1201/9781315370279 2017

[47] [50]

Simon N. Wood. Thin plate regression splines. Journal of the Royal Statistical Society: Series B, 65 0 (1): 0 95--114, 2003. doi:10.1111/1467-9868.00374

work page doi:10.1111/1467-9868.00374 2003

[48] [51]

Golub, Michael Heath, and Grace Wahba

Gene H. Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21 0 (2): 0 215--223, 1979. doi:10.1080/00401706.1979.10489751

work page doi:10.1080/00401706.1979.10489751 1979

[49] [52]

NODE-GAM : Neural generalized additive model for interpretable deep learning

Chun-Hao Chang, Rich Caruana, and Anna Goldenberg. NODE-GAM : Neural generalized additive model for interpretable deep learning. In International Conference on Learning Representations, 2022. URL https://arxiv.org/abs/2106.01613

arXiv 2022

[50] [53]

Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics

Grace Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, 1990. doi:10.1137/1.9781611970128

work page doi:10.1137/1.9781611970128 1990

[51] [54]

David M. Allen. The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16 0 (1): 0 125--127, 1974. doi:10.1080/00401706.1974.10489157

work page doi:10.1080/00401706.1974.10489157 1974

[52] [55]

Cross-validatory choice and assessment of statistical predictions

Mervyn Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B, 36 0 (2): 0 111--133, 1974. doi:10.1111/j.2517-6161.1974.tb00994.x

work page doi:10.1111/j.2517-6161.1974.tb00994.x 1974

[53] [56]

Charles J. Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, 10 0 (4): 0 1040--1053, 1982. doi:10.1214/aos/1176345969

work page doi:10.1214/aos/1176345969 1982

[54] [57]

Charles J. Stone. Additive regression and other nonparametric models. The Annals of Statistics, 13 0 (2): 0 689--705, 1985. doi:10.1214/aos/1176349548

work page doi:10.1214/aos/1176349548 1985

[55] [58]

Jianhua Z. Huang. Projection estimation in multiple regression with application to functional ANOVA models. The Annals of Statistics, 26 0 (1): 0 242--272, 1998. doi:10.1214/aos/1030563984

work page doi:10.1214/aos/1030563984 1998

[56] [59]

Giampiero Marra and Simon N. Wood. Practical variable selection for generalized additive models. Computational Statistics and Data Analysis, 55 0 (7): 0 2372--2387, 2011. doi:10.1016/j.csda.2011.02.004

work page doi:10.1016/j.csda.2011.02.004 2011

[57] [60]

Paul H. C. Eilers and Brian D. Marx. Flexible smoothing with B -splines and penalties. Statistical Science, 11 0 (2): 0 89--121, 1996. doi:10.1214/ss/1038425655

work page doi:10.1214/ss/1038425655 1996

[58] [61]

n-Widths in Approximation Theory

Allan Pinkus. n-Widths in Approximation Theory. Ergebnisse der Mathematik und ihrer Grenzgebiete. Springer, 1985. doi:10.1007/978-3-642-69894-1

work page doi:10.1007/978-3-642-69894-1 1985

[59] [62]

Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation

Peter Craven and Grace Wahba. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31 0 (4): 0 377--403, 1979. doi:10.1007/BF01404567

work page doi:10.1007/bf01404567 1979

[60] [63]

Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

Pith/arXiv arXiv 2001

[61] [64]

Allan H. Murphy. Skill scores based on the mean square error and their relationships to the correlation coefficient. Monthly Weather Review, 116 0 (12): 0 2417--2424, 1988. URL https://journals.ametsoc.org/view/journals/mwre/116/12/1520-0493_1988_116_2417_ssbotm_2_0_co_2.xml

1988

[62] [65]

doi:10.5281/zenodo.1208723 , url =

Daniel Serv \'e n and Charlie Brummitt. pygam: Generalized additive models in Python . Zenodo, 2018. doi:10.5281/zenodo.1208723

work page doi:10.5281/zenodo.1208723 2018

[63] [66]

Hastie and Robert J

Trevor J. Hastie and Robert J. Tibshirani. Generalized Additive Models. Chapman and Hall, 1990

1990

[64] [67]

IEEE Transactions on Automatic Control , volume =

Hirotugu Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19 0 (6): 0 716--723, 1974. doi:10.1109/TAC.1974.1100705

work page doi:10.1109/tac.1974.1100705 1974

[65] [68]

Wolfgang H \"a rdle, Peter Hall, and James S. Marron. How far are automatically chosen regression smoothing parameters from their optimum? Journal of the American Statistical Association, 83 0 (401): 0 86--95, 1988. doi:10.1080/01621459.1988.10478568

work page doi:10.1080/01621459.1988.10478568 1988

[66] [69]

Statistical comparisons of classifiers over multiple data sets

Janez Dem s ar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7: 0 1--30, 2006. URL https://jmlr.org/papers/v7/demsar06a.html

2006

[67] [70]

Colin L. Mallows. Some comments on C_P . Technometrics, 15 0 (4): 0 661--675, 1973. doi:10.1080/00401706.1973.10489103

work page doi:10.1080/00401706.1973.10489103 1973

[68] [71]

The Annals of Statistics , author =

Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6 0 (2): 0 461--464, 1978. doi:10.1214/aos/1176344136

work page doi:10.1214/aos/1176344136 1978

[69] [72]

BOHB : Robust and efficient hyperparameter optimization at scale

Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB : Robust and efficient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 1437--1446. PMLR, 2018. URL https://proceedings.mlr.press/v80/falkner18a.html

2018

[70] [73]

Rishabh Agarwal, Levi Melnick, Nicholas Frosst, Xuezhou Zhang, Ben Lengerich, Rich Caruana, and Geoffrey E. Hinton. Neural additive models: Interpretable machine learning with neural nets. In Advances in Neural Information Processing Systems, volume 34, 2021. URL https://arxiv.org/abs/2004.13912

arXiv 2021

[71] [74]

Melkman and Charles A

Avraham A. Melkman and Charles A. Micchelli. Spline spaces are optimal for L^2 n -width. Illinois Journal of Mathematics, 22 0 (4): 0 541--564, 1978. doi:10.1215/ijm/1256048466

work page doi:10.1215/ijm/1256048466 1978

[72] [75]

OpenML-CTR23 : A curated tabular regression benchmarking suite

Sebastian Felix Fischer, Matthias Feurer, and Bernd Bischl. OpenML-CTR23 : A curated tabular regression benchmarking suite. In AutoML Conference 2023 (Workshop Track), 2023. URL https://openreview.net/forum?id=HebAOoMm94

2023

[73] [76]

AutoGluon-Tabular : Robust and accurate AutoML for structured data

Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. AutoGluon-Tabular : Robust and accurate AutoML for structured data. arXiv preprint arXiv:2003.06505, 2020

Pith/arXiv arXiv 2003

[74] [77]

Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods

P nar T \"u fek c i. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power & Energy Systems, 60: 0 126--140, 2014. doi:10.1016/j.ijepes.2014.02.027

work page doi:10.1016/j.ijepes.2014.02.027 2014

[75] [78]

Random search for hyper-parameter optimization

James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13: 0 281--305, 2012. URL https://jmlr.org/papers/v13/bergstra12a.html

2012

[76] [79]

Pieter Gijsbers, Marcos L. P. Bueno, Stefan Coors, Erin LeDell, S \'e bastien Poirier, Janek Thomas, Bernd Bischl, and Joaquin Vanschoren. AMLB : an AutoML benchmark. Journal of Machine Learning Research, 25 0 (101): 0 1--65, 2024. URL https://jmlr.org/papers/v25/22-0493.html

2024

[77] [80]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. Advances in Neural Information Processing Systems, 35, 2022. URL https://arxiv.org/abs/2203.15556

Pith/arXiv arXiv 2022

[78] [81]

, year =

Jerome H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19 0 (1): 0 1--67, 1991. doi:10.1214/aos/1176347963

work page doi:10.1214/aos/1176347963 1991

[79] [82]

Peter Radchenko and Gareth M. James. Variable selection using adaptive nonlinear interaction structures in high dimensions. Journal of the American Statistical Association, 105 0 (492): 0 1541--1553, 2010. doi:10.1198/jasa.2010.tm10130

work page doi:10.1198/jasa.2010.tm10130 2010

[80] [83]

Schumaker

Larry L. Schumaker. Spline Functions: Basic Theory. Cambridge University Press, 3rd edition, 2007. doi:10.1017/CBO9780511618994

work page doi:10.1017/cbo9780511618994 2007