pith. sign in

arxiv: 2606.23575 · v1 · pith:SOWARJJDnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI· cs.NA· math.NA· stat.ML

Solve for the Hyperparameter, Skip the Search: Kolmogorov-Optimal Scaling Laws for Spline Regression

Pith reviewed 2026-06-26 08:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NAmath.NAstat.ML
keywords spline regressionhyperparameter tuningKolmogorov n-widthscaling lawsANOVA decompositionPRESS identityresolution estimationleave-one-out cross-validation
0
0 comments X

The pith

Spline regression optimal resolution follows from a closed-form expression using Kolmogorov widths and a single-fit error estimate rather than grid search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the hyperparameter controlling spline resolution need not be found by trying many values and scoring them with cross-validation. Classical approximation theory fixes the squared bias as a precise power of resolution, the number of basis functions is a known polynomial in that resolution, and the leave-one-out error can be recovered from one fit via the PRESS identity. Balancing these two explicit curves produces the minimizing resolution directly. The same calculus extends to high-dimensional inputs by replacing ambient dimension with the order of active interactions in an ANOVA decomposition, producing scaling laws in effective sample density that contain no input dimension in the exponent. The resulting KORE procedure uses two pilot resolutions to calibrate the unknown scales and then evaluates the closed-form solution with a leave-one-out certificate.

Core claim

The optimal resolution G* is the analytic minimizer obtained by setting the derivative of estimated risk (Kolmogorov-powered bias plus PRESS-derived variance) to zero; when the input is replaced by interaction order r in the ANOVA decomposition the same minimizer yields power-law scaling of both optimal G and risk with effective density n divided by the number of active r-way terms, independent of ambient dimension d.

What carries the argument

The Kolmogorov n-width of the smoothness class, which supplies the exact power-law form of squared bias as a function of resolution G.

If this is right

  • Optimal resolution and risk become explicit power functions of sample size per active interaction component.
  • Ambient input dimension drops out of the exponent once interaction order is fixed.
  • KORE performs roughly eight times fewer fits than a full grid sweep while matching exhaustive 3-fold cross-validation.
  • On tabular data the method ranks first among twenty-one competitors in accuracy per unit compute.
  • The same analytic balance reproduces the ordering produced by GCV, Mallows Cp, AIC and BIC.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If approximation rates are known for other bases, the same two-pilot calibration could replace search for their resolution or bandwidth parameters.
  • The interaction-order reduction suggests that problems whose complexity lives in low-order terms remain tractable even when ambient dimension grows.
  • The leave-one-out certificate after the closed-form evaluation supplies a cheap consistency check that could be used to decide whether a third pilot fit is warranted.

Load-bearing premise

Squared bias equals exactly a known power of resolution G given by the Kolmogorov n-width of the smoothness class.

What would settle it

On data drawn from a function whose smoothness class is known in advance, compare the closed-form resolution against the resolution that actually minimizes out-of-sample error; a large systematic difference falsifies the claim.

Figures

Figures reproduced from arXiv: 2606.23575 by Kathleen A. Yearick, Yong Yi Bay.

Figure 1
Figure 1. Figure 1: Law-driven versus search-driven resolution selection. Cross-validation evaluates every grid candidate (clay dots). KORE fits two pilot resolutions, identifies the bias and variance scale constants, and reports the closed-form optimum (navy star). The dashed curves show how squared bias (falling) and variance (rising) compose the U-shaped error curve. Contributions. The paper makes four contributions. 1. An… view at source ↗
Figure 2
Figure 2. Figure 2: Effective-density collapse. The horizontal axis is ρ; the vertical axis is test RMSE at the KORE-selected G ⋆ . Panel (a) shows additive targets with ρ = n/d and reference slope ρ −4/9 . Panel (b) shows sparse pairwise targets with ρ = n/s and reference slope ρ −4/10. The four dimensions d ∈ {10, 20, 40, 80} collapse onto a single curve. Estimator and leave-one-out score. All methods share the same fitting… view at source ↗
Figure 3
Figure 3. Figure 3: Bias-scale recovery as predicted by Theorem 2: median and interquartile band of the ratio Abf /Af versus sample size n along a geometric ladder, with the population value 1 marked. to n = ρd ≤ 60,000. Sparse pairwise targets sweep ρ = n/s over {60, 90, 120, 180, 240, 360, 480, 720} at the same four dimensions with s = d/2 active pairs, subject to n = ρs ≤ 60,000. Every cell uses the seed rule of [PITH_FUL… view at source ↗
Figure 4
Figure 4. Figure 4: Noise-scale recovery as predicted by Theorem 2: median and interquartile band of τbf /τf versus sample size n along the same geometric ladder, with the population value 1 marked. 300 600 1200 2400 4800 9600 19200 sample size n 1.5 2.0 2.5 3.0 3.5 4.0 spline resolution Gc† f § 1¾ Gc† f mean G² f population [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Plug-in continuous optimizer Gb† f versus the anchored population target G • f along the sample-size ladder. Shaded band: ±1σ across 20 seeds; the diamond marker is the anchored target at the largest n. 4.4 CLOSED-FORM SELECTION ON THE ACCURACY-COMPUTE FRONTIER Setup. The frontier experiment tests whether the closed-form law translates into a practical accuracy￾compute win. The benchmarks are three additiv… view at source ↗
Figure 6
Figure 6. Figure 6: Selection cost across six controlled tasks (three additive, three sparse pairwise). Markers report the number of model fits used by KORE, the four full-grid criteria (GCV, Cp, AIC, BIC), and exhaustive 3-fold cross-validation. 0.96 0.97 0.98 0.99 1.00 1.01 1.02 1.03 1.04 KORE / CV RMSE Add d = 10 Add d = 20 Add d = 40 Pair d = 10 Pair d = 20 Pair d = 40 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-task accuracy parity on the same six tasks: ratio of KORE test RMSE to exhaustive 3-fold cross-validation, sorted by family. Bars at or below 1.0 favor the closed-form selector. differ only in scoring formula. KORE uses 9 to 14 fits because the law identifies the right neighborhood before any refinement begins [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Geometric-mean RMSE ratio versus exhaustive cross-validation, aggregated across all six controlled tasks, for the five selectors compared in this section. Bars to the left of 1.0 match or beat exhaustive search. 0.00 0.25 0.50 0.75 1.00 KORE / CV RMSE Nguyen-1 Nguyen-9 (2D add) Nguyen-7 Nguyen-5 Nguyen-4 SparseAdd-20D Friedman-1 (5D) SparsePair-10D Nguyen-10 (2D int) (a) smooth low-order benchmarks 0 5 10 … view at source ↗
Figure 9
Figure 9. Figure 9: Nine law-aligned benchmark equations. Panel (a) gives the per-task RMSE ratio of KORE against 3-fold CV. Panel (b) gives the corresponding fit-count reduction (CV fits divided by KORE fits). Panel (c) gives the geometric-mean RMSE ratio against CV across the nine tasks for the five selectors. BIC at 1.058). KORE is Pareto-dominant on this frontier, and it dominates because it replaces the full-grid pass wi… view at source ↗
Figure 10
Figure 10. Figure 10: Forest plot of mean test RMSE per benchmark across the full 12-equation suite, ranked by KORE / CV ratio. Markers compare KORE (closed-form), exhaustive 3-fold cross-validation, and GCV. Raw RMSE ratios are tabulated in [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Nemenyi critical-difference diagram on Compute-Normalized Lift over OLS, CNLα = max{0, max(0, R2 ) − max(0, R2 OLS)}/(1 + t) α at α = 1, across all 21 methods with complete five-seed cover￾age on every one of the 36 datasets. Mean rank lower is greater OLS-relative lift per unit compute; methods connected by a horizontal bar are statistically indistinguishable at αNemenyi = 0.05. OLS itself has CNL identi… view at source ↗
Figure 12
Figure 12. Figure 12: Per-method paired Wilcoxon signed-rank test of CNLKORE,d,c − CNLm,d,c against zero at α = 1, paired across (dataset, seed) and Holm-Bonferroni corrected over the 20-method family. Bars right of the dashed reference are statistically distinguishable from KORE at pHolm = 0.05; blue marks methods where KORE has the higher Compute-Normalized Lift over OLS, red marks the (none in this panel) where the competit… view at source ↗
Figure 13
Figure 13. Figure 13: Per-dataset Compute-Normalized Lift, KORE versus k-NN, on the full 36-dataset suite. The dashed diagonal is y = x. Most datasets lie above the diagonal: KORE wins on CNL on 19 of the 36 datasets, k-NN on 11, with six tied. intervals on the headline mean ranks. The top of the table reads KORE 4.33 [2.80, 6.03], kernel-ridge 5.34 [4.28, 6.46], k-NN 7.54 [5.46, 9.65], HistGradientBoosting 7.79 [6.64, 9.00], … view at source ↗
Figure 14
Figure 14. Figure 14: KORE mean Friedman rank stratified by training-set-size quartile (lower is better). The other 20 methods are shown as a faint backdrop; KORE is the focal series. 0 25 50 75 100 125 150 175 Post-one-hot dimension d 2.5 5.0 7.5 10.0 12.5 15.0 17.5 KORE Friedman rank (lower is better) d = 30 cutoff per dataset median in region [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: KORE per-dataset Friedman rank versus post-one-hot dimension d. The dashed vertical reference marks the pre-registered d = 30 cutoff. Median rank in each region is reported as a horizontal segment. τbf above the noise-floor variance and biases Gb† f downward (smoother fit) by a factor (1 + rh) −1/(2β+rf ) where rh = Var(σ 2 (X))/E[σ 2 (X)]2 is the noise-variance heterogeneity ratio. Heavy-tailed sub-expon… view at source ↗
Figure 16
Figure 16. Figure 16: Per-dataset Compute-Normalized Lift log-ratio against each classical spline resolution selector: each point is log10(CNLKORE/CNLcompetitor) on a single dataset. Points right of zero favor KORE; black diamonds mark medians; y-tick parentheticals count (KORE wins/datasets). Each row is one classical resolution selector and each point one dataset; the concentration of points to the right of zero, with median… view at source ↗
Figure 17
Figure 17. Figure 17: Efficiency-accuracy Pareto frontier on the smooth-low-d subset of 25 datasets (post-one-hot d ≤ 30). Both axes are ratios against KORE on log scales, so KORE sits at (1, 1). Family-colored markers; frontier methods full opacity, dominated methods recede. KORE sits at the elbow. returns the train mean in arbitrarily small wall time scores zero, and a method that copies OLS in arbitrarily small wall time al… view at source ↗
Figure 18
Figure 18. Figure 18: Sensitivity of the Compute-Normalized Lift verdict to the compute weight α in δm,d,c(α) = CNLα(KORE) − CNLα(m). Panel (a): paired Wilcoxon test at α = 1, Holm-Bonferroni corrected. Panel (b): count of competitors with significantly higher and significantly lower CNL than KORE as α sweeps {0, 0.25, 0.5, 1, 2}; at α = 0 (pure lift over OLS) the panel splits 9-9 between boosters that win on raw OLS-relative … view at source ↗
Figure 19
Figure 19. Figure 19: Empirical confirmation of Proposition 2. Empirical standard deviation of Abf (left) and τbf (right) across 100 noise replicates per training size n, against the 1/ √ n envelope predicted by the proposition. The deterministic target is a d = 10 additive sine-sum at fixed design and varying noise; the envelope prefactor is fit by least squares on the empirical points. Proposition 3 (Rate for the closed-form… view at source ↗
Figure 20
Figure 20. Figure 20: Residual graph discovery. Once the sample budget reaches roughly n/s ≥ 240, the discovered graph matches the oracle graph almost exactly and the resulting RMSE is indistinguishable from the oracle-graph model. C.3 ROBUSTNESS TO STRUCTURAL ASSUMPTIONS [PITH_FULL_IMAGE:figures/full_fig_p039_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Robustness across four conditions: Control (correct smooth low-order structure), 3-way (genuine 3-way interactions), Non-smooth (non-smooth target), and Wrong graph (approximate interaction graph). Numbers 12 and 24 denote input dimension d. Under Control, KORE matches exhaustive CV exactly; in the other three conditions it remains comparable. 20 40 60 80 dimension d 7.5 10.0 12.5 15.0 17.5 20.0 CV / KORE… view at source ↗
Figure 22
Figure 22. Figure 22: Scaling with input dimension from d = 10 to d = 80. Panel (a) shows the fit-count and wall-clock speedup of KORE against exhaustive cross-validation for additive (teal) and sparse pairwise (rose) families. Panel (b) shows the RMSE ratio of KORE against cross-validation, with the parity line at 1.0 for reference. Panel (a) shows the cost comparison. The solid lines count how many times more model fits cros… view at source ↗
Figure 23
Figure 23. Figure 23: Degree ablation in the interior-optimum regime (d = 20, 10% training noise, five seeds per cell). Each marker is the mean closed-form plug-in resolution Gb† as a function of effective density ρ = n/d. Dotted lines show the predicted power law ρ 1/(2β+1) at each spline degree, with β = k + 1. (Section 4.3), where the two pilots identify the bias-variance balance directly rather than the stability cap, so t… view at source ↗
Figure 24
Figure 24. Figure 24: Per-dataset Compute-Normalized Lift ratio against KORE on the full 36-dataset OpenML-CTR23 plus UCI suite, for the four strongest CNL competitors (k-NN, kernel ridge, BIC-tuned splines, GCV-tuned splines). Markers are medians across five seeds; markers right of 1× favor KORE, markers left favor the competitor. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Per-dataset Compute-Normalized Lift ratio on the smooth-low-d subset (post-one-hot dimension at most 30), restricted to the regime in which the closed-form law is calibrated. Same markers and conventions as [PITH_FULL_IMAGE:figures/full_fig_p045_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Per-cell peak resident-set size on the real-world benchmark, one row per method: the line runs from the method’s median (open marker) to its maximum (filled marker) across all datasets, sorted by the maximum and colored by family, with the soft 8 GiB per-cell cap as the dashed reference. Every method’s median cell is small; only the classical full-grid spline selectors carry a tail past the cap on the hig… view at source ↗
read the original abstract

Hyperparameter tuning almost always means search: fit the model at every value on a grid, score each by cross-validation, and keep the winner. For spline regression that search is unnecessary. The optimal resolution can be solved for in closed form, to the accuracy an exhaustive search reaches, at a fraction of the compute. Three ingredients make this possible: classical approximation theory pins the squared bias to a known power of the resolution G, exactly the Kolmogorov n-width of the smoothness class; the basis dimension is an explicit polynomial in G; and leave-one-out error follows from a single fit via the PRESS identity. Balancing the two known curves gives the minimizer analytically. We extend this calculus to many coordinates by replacing ambient input dimension with interaction order, the number of active low-order components in an ANOVA decomposition, yielding a scaling law in which the optimal resolution and error are power functions of the effective density (sample size per active component), with input dimension absent from the exponent. The law becomes an algorithm. KORE (Kolmogorov-optimal Order-aware Resolution Estimation) fits two pilot resolutions, solves a leverage-calibrated 2x2 system for the bias and noise scales, and evaluates the closed-form plug-in resolution with a tiny leave-one-out certificate: about a dozen fits instead of a full grid sweep, with a consistency guarantee as the sample grows. Across additive and sparse pairwise targets up to 80 input dimensions, KORE matches exhaustive 3-fold cross-validation and the full classical ladder (GCV, Mallows' Cp, AIC, BIC) while fitting roughly 8x fewer models; on 36 real tabular datasets it ranks first among 21 methods in accuracy per unit of compute, ahead of tuned boosters and kernel machines. When complexity lives in low interaction order, solving for the resolution beats searching for it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that for spline regression the optimal resolution G can be obtained in closed form by balancing an exact power-law squared-bias term taken from the Kolmogorov n-width of the smoothness class against an explicit polynomial basis dimension and a PRESS-derived variance term; the resulting KORE algorithm estimates the two unknown scales from two pilot fits, plugs the closed-form G* into the model, and achieves accuracy comparable to exhaustive 3-fold CV or classical criteria (GCV, Cp, AIC, BIC) while using roughly 8× fewer fits. The method is extended to high-dimensional additive and sparse pairwise targets by replacing ambient dimension with interaction order, producing scaling laws in which optimal resolution and risk depend only on effective density.

Significance. If the analytic minimizer is reliable, the work supplies a low-compute, theoretically grounded alternative to grid search for spline hyper-parameters and a dimension-free scaling law for models whose complexity is governed by low-order interactions. The empirical ranking on 36 real tabular data sets and the consistency guarantee as n grows are concrete strengths.

major comments (2)
  1. [§3] §3 (derivation of closed-form G*): the central step equates squared bias exactly to θ G^{-α} with α taken from the Kolmogorov n-width of the smoothness class. Kolmogorov n-width supplies only the asymptotic minimax rate; for a concrete spline subspace and fixed target the realized bias can contain G-dependent constants, log factors, or slower decay. Because the subsequent plug-in G* and all scaling-law claims rest on this exact power-law identity, the manuscript must either derive a bound showing that lower-order terms do not change the location of the minimizer or demonstrate empirically that the recovered G* remains within a small relative error of the true risk minimizer across the tested regimes.
  2. [§4.2] §4.2 (two-pilot calibration): the 2×2 leverage-calibrated system solves for bias and noise scales from two pilot resolutions fitted to the same data later used for evaluation. While the paper states a consistency guarantee as n→∞, for finite samples the procedure is partly data-dependent; the manuscript should report the sensitivity of the final G* to the choice of the two pilot resolutions and quantify how often the plug-in G* deviates from the CV optimum by more than a stated tolerance on the synthetic suite.
minor comments (2)
  1. [Abstract / §5] The abstract states that KORE “matches exhaustive 3-fold cross-validation,” yet the main text does not tabulate the distribution of |G_KORE – G_CV| or the relative excess risk; adding a supplementary table with these quantities (mean, median, 90th percentile) would strengthen the empirical claim.
  2. [§4] Notation for the interaction order and effective density is introduced without an explicit forward reference; a short definitional paragraph early in §4 would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our theoretical claims and the need for additional empirical safeguards. We address each major point below and will incorporate revisions to strengthen the presentation of the asymptotic justification and the finite-sample behavior of the two-pilot procedure.

read point-by-point responses
  1. Referee: [§3] §3 (derivation of closed-form G*): the central step equates squared bias exactly to θ G^{-α} with α taken from the Kolmogorov n-width of the smoothness class. Kolmogorov n-width supplies only the asymptotic minimax rate; for a concrete spline subspace and fixed target the realized bias can contain G-dependent constants, log factors, or slower decay. Because the subsequent plug-in G* and all scaling-law claims rest on this exact power-law identity, the manuscript must either derive a bound showing that lower-order terms do not change the location of the minimizer or demonstrate empirically that the recovered G* remains within a small relative error of the true risk minimizer across the tested regimes.

    Authors: We agree that the Kolmogorov n-width result is asymptotic. The manuscript invokes the leading-term rate because, for the Sobolev-type classes and spline spaces considered, standard approximation-theory bounds show that the bias is θ G^{-α} (1 + o(1)) uniformly over the function class once G exceeds a modest threshold; the location of the risk minimizer is therefore insensitive to lower-order terms for the sample sizes and resolutions used in the experiments. Nevertheless, to make this explicit we will revise §3 to state the asymptotic character of the identity, add a short appendix deriving a sufficient condition under which the o(1) term cannot shift the argmin by more than a relative factor of 1+ε, and include an empirical panel (new Figure) that plots, for each synthetic regime, the relative error |G*_KORE - G*_CV| / G*_CV together with the fraction of trials in which this error exceeds 10 %. These additions will be marked as addressing the referee’s concern. revision: yes

  2. Referee: [§4.2] §4.2 (two-pilot calibration): the 2×2 leverage-calibrated system solves for bias and noise scales from two pilot resolutions fitted to the same data later used for evaluation. While the paper states a consistency guarantee as n→∞, for finite samples the procedure is partly data-dependent; the manuscript should report the sensitivity of the final G* to the choice of the two pilot resolutions and quantify how often the plug-in G* deviates from the CV optimum by more than a stated tolerance on the synthetic suite.

    Authors: We will expand §4.2 with a sensitivity table that varies the two pilot resolutions over a grid of plausible values (e.g., G1 ∈ {5,10,15}, G2 ∈ {20,30,40}) and reports, for each synthetic configuration, both the standard deviation of the resulting G* and the empirical frequency with which |G*_KORE - G*_CV| / G*_CV > 0.15. The same panel will also show the distribution of the leave-one-out certificate error. These diagnostics will be added as a new subsection and will be referenced in the main text when the two-pilot procedure is introduced. revision: yes

Circularity Check

0 steps flagged

No significant circularity; closed-form follows from external Kolmogorov n-width, explicit dimension, and PRESS identity

full rationale

The derivation balances an assumed exact power-law bias term (Kolmogorov n-width of the smoothness class, an external result from approximation theory), an explicit polynomial expression for basis dimension in G, and the standard PRESS identity for leave-one-out variance. KORE estimates the two scale constants θ and σ² from two pilot fits via a 2×2 system and plugs them into the already-derived analytic minimizer; this is calibration to apply the formula rather than a fitted input renamed as prediction. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The central claim therefore remains independent of its own outputs and is self-contained against the stated external ingredients and identities.

Axiom & Free-Parameter Ledger

2 free parameters · 3 axioms · 0 invented entities

The central claim rests on three classical results (Kolmogorov n-width bias power, polynomial basis dimension in G, PRESS identity for LOO) plus the modeling choice that interaction order replaces ambient dimension; the two pilot fits introduce fitted bias and noise scales that are not supplied by prior literature.

free parameters (2)
  • bias scale
    Estimated by solving the 2x2 system from the two pilot resolutions in KORE
  • noise scale
    Estimated by solving the 2x2 system from the two pilot resolutions in KORE
axioms (3)
  • domain assumption squared bias equals a known power of resolution G exactly equal to the Kolmogorov n-width of the smoothness class
    Invoked to pin bias to a known function of G
  • standard math basis dimension is an explicit polynomial in G
    Required for the explicit variance term
  • standard math leave-one-out error is obtained from a single fit via the PRESS identity
    Enables analytic variance curve without refitting

pith-pipeline@v0.9.1-grok · 5882 in / 1688 out tokens · 38655 ms · 2026-06-26T08:49:39.284801+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. No 3D Matrices: A Unified Tensor-Product View of Matrix-Free Cartesian PDE Solvers

    math.NA 2026-06 unverdicted novelty 2.0

    Cartesian 3D PDE operators factor exactly into 1D line kernels via Kronecker algebra, yielding O(N) cost and O(Nx+Ny+Nz) storage for any fixed stencil or polynomial degree.

Reference graph

Works this paper leans on

83 extracted references · 27 canonical work pages · cited by 1 Pith paper

  1. [1]

    , title =

    Allen, David M. , title =. Technometrics , volume =. 1974 , doi =

  2. [3]

    Numerische Mathematik , volume =

    Craven, Peter and Wahba, Grace , title =. Numerische Mathematik , volume =. 1979 , doi =

  3. [4]

    de Boor, Carl , title =

  4. [5]

    , title =

    Friedman, Jerome H. , title =. The Annals of Statistics , volume =. 1991 , doi =

  5. [6]

    and Heath, Michael and Wahba, Grace , title =

    Golub, Gene H. and Heath, Michael and Wahba, Grace , title =. Technometrics , volume =. 1979 , doi =

  6. [7]

    2013 , doi =

    Gu, Chong , title =. 2013 , doi =

  7. [8]

    and Tibshirani, Robert J

    Hastie, Trevor J. and Tibshirani, Robert J. , title =

  8. [9]

    , title =

    Huang, Jianhua Z. , title =. The Annals of Statistics , volume =. 1998 , doi =

  9. [10]

    Journal of the Royal Statistical Society: Series B , volume =

    Ravikumar, Pradeep and Lafferty, John and Liu, Han and Wasserman, Larry , title =. Journal of the Royal Statistical Society: Series B , volume =. 2009 , doi =

  10. [11]

    , title =

    Mallows, Colin L. , title =. Technometrics , volume =. 1973 , doi =

  11. [12]

    IEEE Transactions on Automatic Control , volume =

    Akaike, Hirotugu , title =. IEEE Transactions on Automatic Control , volume =. 1974 , doi =

  12. [13]

    The Annals of Statistics , volume =

    Schwarz, Gideon , title =. The Annals of Statistics , volume =. 1978 , doi =

  13. [14]

    Journal of the Royal Statistical Society: Series B , volume =

    Stone, Mervyn , title =. Journal of the Royal Statistical Society: Series B , volume =. 1974 , doi =

  14. [15]

    , title =

    Schumaker, Larry L. , title =. 2007 , doi =

  15. [16]

    , title =

    Kolmogorov, Andrey N. , title =. Annals of Mathematics , volume =

  16. [17]

    1985 , doi =

    Pinkus, Allan , title =. 1985 , doi =

  17. [18]

    and Micchelli, Charles A

    Melkman, Avraham A. and Micchelli, Charles A. , title =. Illinois Journal of Mathematics , volume =. 1978 , doi =

  18. [19]

    , title =

    Stone, Charles J. , title =. The Annals of Statistics , volume =. 1982 , doi =

  19. [20]

    , title =

    Stone, Charles J. , title =. The Annals of Statistics , volume =. 1985 , doi =

  20. [21]

    , title =

    Wood, Simon N. , title =. 2017 , doi =

  21. [23]

    Advances in Neural Information Processing Systems , volume =

    Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and others , title =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

  22. [24]

    AutoML Conference 2023 (Workshop Track) , year =

    Fischer, Sebastian Felix and Feurer, Matthias and Bischl, Bernd , title =. AutoML Conference 2023 (Workshop Track) , year =

  23. [25]

    Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods , journal =

    T. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods , journal =. 2014 , doi =

  24. [26]

    pyGAM: Generalized Additive Models in

    Serv. pyGAM: Generalized Additive Models in. Zenodo , year =

  25. [27]

    Why do tree-based models still outperform deep learning on typical tabular data? , journal =

    Grinsztajn, L. Why do tree-based models still outperform deep learning on typical tabular data? , journal =. 2022 , url =

  26. [28]

    , title =

    Murphy, Allan H. , title =. Monthly Weather Review , volume =. 1988 , url =

  27. [29]

    1990 , doi =

    Wahba, Grace , title =. 1990 , doi =

  28. [30]

    Eilers, Paul H. C. and Marx, Brian D. , title =. Statistical Science , volume =. 1996 , doi =

  29. [31]

    , title =

    Wood, Simon N. , title =. Journal of the Royal Statistical Society: Series B , volume =. 2003 , doi =

  30. [32]

    , title =

    Marra, Giampiero and Wood, Simon N. , title =. Computational Statistics and Data Analysis , volume =. 2011 , doi =

  31. [33]

    The Annals of Statistics , volume =

    Lin, Yi and Zhang, Hao Helen , title =. The Annals of Statistics , volume =. 2006 , doi =

  32. [34]

    , title =

    Radchenko, Peter and James, Gareth M. , title =. Journal of the American Statistical Association , volume =. 2010 , doi =

  33. [35]

    The Annals of Statistics , volume =

    Bien, Jacob and Taylor, Jonathan and Tibshirani, Robert , title =. The Annals of Statistics , volume =. 2013 , doi =

  34. [36]

    , title =

    Agarwal, Rishabh and Melnick, Levi and Frosst, Nicholas and Zhang, Xuezhou and Lengerich, Ben and Caruana, Rich and Hinton, Geoffrey E. , title =. Advances in Neural Information Processing Systems , volume =. 2021 , url =

  35. [37]

    International Conference on Learning Representations , year =

    Chang, Chun-Hao and Caruana, Rich and Goldenberg, Anna , title =. International Conference on Learning Representations , year =

  36. [38]

    Journal of Machine Learning Research , volume =

    Bergstra, James and Bengio, Yoshua , title =. Journal of Machine Learning Research , volume =. 2012 , url =

  37. [39]

    Proceedings of the 35th International Conference on Machine Learning , series =

    Falkner, Stefan and Klein, Aaron and Hutter, Frank , title =. Proceedings of the 35th International Conference on Machine Learning , series =. 2018 , url =

  38. [40]

    Gijsbers, Pieter and Bueno, Marcos L. P. and Coors, Stefan and LeDell, Erin and Poirier, S. Journal of Machine Learning Research , volume =. 2024 , url =

  39. [42]

    Statistical comparisons of classifiers over multiple data sets , journal =

    Dem. Statistical comparisons of classifiers over multiple data sets , journal =. 2006 , url =

  40. [43]

    How far are automatically chosen regression smoothing parameters from their optimum? , journal =

    H. How far are automatically chosen regression smoothing parameters from their optimum? , journal =. 1988 , doi =

  41. [44]

    Smoothing Spline ANOVA Models

    Chong Gu. Smoothing Spline ANOVA Models . Springer, 2nd edition, 2013. doi:10.1007/978-1-4614-5369-7

  42. [45]

    A Practical Guide to Splines, volume 27 of Applied Mathematical Sciences

    Carl de Boor. A Practical Guide to Splines, volume 27 of Applied Mathematical Sciences. Springer, revised edition, 2001

  43. [46]

    Component selection and smoothing in multivariate nonparametric regression

    Yi Lin and Hao Helen Zhang. Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34 0 (5): 0 2272--2297, 2006. doi:10.1214/009053606000000722

  44. [47]

    Yong Yi Bay and Kathleen A. Yearick. Machine learning vs deep learning: The generalization problem. arXiv preprint arXiv:2403.01621, 2024

  45. [48]

    A lasso for hierarchical interactions

    Jacob Bien, Jonathan Taylor, and Robert Tibshirani. A lasso for hierarchical interactions. The Annals of Statistics, 41 0 (3): 0 1111--1141, 2013. doi:10.1214/13-AOS1096

  46. [49]

    Simon N. Wood. Generalized Additive Models: An Introduction with R . Chapman and Hall/CRC, 2nd edition, 2017. doi:10.1201/9781315370279

  47. [50]

    Simon N. Wood. Thin plate regression splines. Journal of the Royal Statistical Society: Series B, 65 0 (1): 0 95--114, 2003. doi:10.1111/1467-9868.00374

  48. [51]

    Golub, Michael Heath, and Grace Wahba

    Gene H. Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21 0 (2): 0 215--223, 1979. doi:10.1080/00401706.1979.10489751

  49. [52]

    NODE-GAM : Neural generalized additive model for interpretable deep learning

    Chun-Hao Chang, Rich Caruana, and Anna Goldenberg. NODE-GAM : Neural generalized additive model for interpretable deep learning. In International Conference on Learning Representations, 2022. URL https://arxiv.org/abs/2106.01613

  50. [53]

    Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics

    Grace Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, 1990. doi:10.1137/1.9781611970128

  51. [54]

    David M. Allen. The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16 0 (1): 0 125--127, 1974. doi:10.1080/00401706.1974.10489157

  52. [55]

    Cross-validatory choice and assessment of statistical predictions

    Mervyn Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B, 36 0 (2): 0 111--133, 1974. doi:10.1111/j.2517-6161.1974.tb00994.x

  53. [56]

    Charles J. Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, 10 0 (4): 0 1040--1053, 1982. doi:10.1214/aos/1176345969

  54. [57]

    Charles J. Stone. Additive regression and other nonparametric models. The Annals of Statistics, 13 0 (2): 0 689--705, 1985. doi:10.1214/aos/1176349548

  55. [58]

    Jianhua Z. Huang. Projection estimation in multiple regression with application to functional ANOVA models. The Annals of Statistics, 26 0 (1): 0 242--272, 1998. doi:10.1214/aos/1030563984

  56. [59]

    Giampiero Marra and Simon N. Wood. Practical variable selection for generalized additive models. Computational Statistics and Data Analysis, 55 0 (7): 0 2372--2387, 2011. doi:10.1016/j.csda.2011.02.004

  57. [60]

    Paul H. C. Eilers and Brian D. Marx. Flexible smoothing with B -splines and penalties. Statistical Science, 11 0 (2): 0 89--121, 1996. doi:10.1214/ss/1038425655

  58. [61]

    n-Widths in Approximation Theory

    Allan Pinkus. n-Widths in Approximation Theory. Ergebnisse der Mathematik und ihrer Grenzgebiete. Springer, 1985. doi:10.1007/978-3-642-69894-1

  59. [62]

    Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation

    Peter Craven and Grace Wahba. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31 0 (4): 0 377--403, 1979. doi:10.1007/BF01404567

  60. [63]

    Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  61. [64]

    Allan H. Murphy. Skill scores based on the mean square error and their relationships to the correlation coefficient. Monthly Weather Review, 116 0 (12): 0 2417--2424, 1988. URL https://journals.ametsoc.org/view/journals/mwre/116/12/1520-0493_1988_116_2417_ssbotm_2_0_co_2.xml

  62. [65]

    doi:10.5281/zenodo.1208723 , url =

    Daniel Serv \'e n and Charlie Brummitt. pygam: Generalized additive models in Python . Zenodo, 2018. doi:10.5281/zenodo.1208723

  63. [66]

    Hastie and Robert J

    Trevor J. Hastie and Robert J. Tibshirani. Generalized Additive Models. Chapman and Hall, 1990

  64. [67]

    IEEE Transactions on Automatic Control , volume =

    Hirotugu Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19 0 (6): 0 716--723, 1974. doi:10.1109/TAC.1974.1100705

  65. [68]

    Wolfgang H \"a rdle, Peter Hall, and James S. Marron. How far are automatically chosen regression smoothing parameters from their optimum? Journal of the American Statistical Association, 83 0 (401): 0 86--95, 1988. doi:10.1080/01621459.1988.10478568

  66. [69]

    Statistical comparisons of classifiers over multiple data sets

    Janez Dem s ar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7: 0 1--30, 2006. URL https://jmlr.org/papers/v7/demsar06a.html

  67. [70]

    Colin L. Mallows. Some comments on C_P . Technometrics, 15 0 (4): 0 661--675, 1973. doi:10.1080/00401706.1973.10489103

  68. [71]

    The Annals of Statistics , author =

    Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6 0 (2): 0 461--464, 1978. doi:10.1214/aos/1176344136

  69. [72]

    BOHB : Robust and efficient hyperparameter optimization at scale

    Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB : Robust and efficient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 1437--1446. PMLR, 2018. URL https://proceedings.mlr.press/v80/falkner18a.html

  70. [73]

    Rishabh Agarwal, Levi Melnick, Nicholas Frosst, Xuezhou Zhang, Ben Lengerich, Rich Caruana, and Geoffrey E. Hinton. Neural additive models: Interpretable machine learning with neural nets. In Advances in Neural Information Processing Systems, volume 34, 2021. URL https://arxiv.org/abs/2004.13912

  71. [74]

    Melkman and Charles A

    Avraham A. Melkman and Charles A. Micchelli. Spline spaces are optimal for L^2 n -width. Illinois Journal of Mathematics, 22 0 (4): 0 541--564, 1978. doi:10.1215/ijm/1256048466

  72. [75]

    OpenML-CTR23 : A curated tabular regression benchmarking suite

    Sebastian Felix Fischer, Matthias Feurer, and Bernd Bischl. OpenML-CTR23 : A curated tabular regression benchmarking suite. In AutoML Conference 2023 (Workshop Track), 2023. URL https://openreview.net/forum?id=HebAOoMm94

  73. [76]

    AutoGluon-Tabular : Robust and accurate AutoML for structured data

    Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. AutoGluon-Tabular : Robust and accurate AutoML for structured data. arXiv preprint arXiv:2003.06505, 2020

  74. [77]

    Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods

    P nar T \"u fek c i. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power & Energy Systems, 60: 0 126--140, 2014. doi:10.1016/j.ijepes.2014.02.027

  75. [78]

    Random search for hyper-parameter optimization

    James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13: 0 281--305, 2012. URL https://jmlr.org/papers/v13/bergstra12a.html

  76. [79]

    Pieter Gijsbers, Marcos L. P. Bueno, Stefan Coors, Erin LeDell, S \'e bastien Poirier, Janek Thomas, Bernd Bischl, and Joaquin Vanschoren. AMLB : an AutoML benchmark. Journal of Machine Learning Research, 25 0 (101): 0 1--65, 2024. URL https://jmlr.org/papers/v25/22-0493.html

  77. [80]

    Training compute-optimal large language models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. Advances in Neural Information Processing Systems, 35, 2022. URL https://arxiv.org/abs/2203.15556

  78. [81]

    , year =

    Jerome H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19 0 (1): 0 1--67, 1991. doi:10.1214/aos/1176347963

  79. [82]

    Peter Radchenko and Gareth M. James. Variable selection using adaptive nonlinear interaction structures in high dimensions. Journal of the American Statistical Association, 105 0 (492): 0 1541--1553, 2010. doi:10.1198/jasa.2010.tm10130

  80. [83]

    Schumaker

    Larry L. Schumaker. Spline Functions: Basic Theory. Cambridge University Press, 3rd edition, 2007. doi:10.1017/CBO9780511618994

Showing first 80 references.