arxiv: 2605.13587 · v1 · submitted 2026-05-13 · 📊 stat.ML · cs.LG· eess.SP

Recognition: unknown

Reframing preprocessing selection as model-internal calibration in near-infrared spectroscopy: A large-scale benchmark of operator-adaptive PLS and Ridge models

Camille No\^us, Denis Cornet, Gregory Beurier, Lauriane Rouan, Robin Reiter

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:23 UTC · model grok-4.3

classification 📊 stat.ML cs.LGeess.SP

keywords near-infrared spectroscopyNIRSpreprocessing selectionoperator-adaptive modelsPLS regressionRidge regressionspectral calibrationmodel-internal selection

0 comments

The pith

Operator-adaptive models that fold preprocessing selection inside calibration outperform standard PLS and Ridge on most NIRS datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes moving the selection of spectral preprocessing from an external search into the calibration model itself. Linear treatments become operators that the model can choose among, while nonlinear corrections such as SNV, MSC, and ASLS are handled as branches evaluated only within each cross-validation fold. This setup is instantiated for both PLS and Ridge regression and tested on more than fifty heterogeneous near-infrared datasets. The resulting models aim to deliver comparable or better predictive accuracy while eliminating the need for costly, unstable preprocessing hyperparameter searches. A sympathetic reader would care because routine NIRS work often involves small calibration sets where external searches are impractical and hard to audit.

Core claim

Compact operator-adaptive PLS models that include ASLS branch preprocessing reach a median RMSEP ratio of 0.960 relative to ordinary PLS and win on 42 of 57 datasets; a deployable AOM-Ridge selector improves over tuned Ridge by a median 2.22 percent with 35 wins on 52 datasets. The framework encodes candidate linear preprocessing steps as spectral operators, uses covariance identities to keep PLS variants fast while retaining original-wavelength coefficients, and employs operator-adaptive kernels for a dual Ridge formulation whose coefficients remain recoverable in the original space.

What carries the argument

Operator-adaptive calibration, which treats candidate preprocessing steps as selectable linear spectral operators and confines nonlinear or sample-adaptive corrections to fold-local branches.

Load-bearing premise

That fold-local branches for nonlinear corrections fully prevent information leakage and avoid introducing bias or overfitting to the particular cross-validation splits.

What would settle it

Running the operator-adaptive models on a fresh collection of NIRS datasets and finding that their median performance falls below that of standard external preprocessing searches, or detecting leakage when the branches are examined for dependence on held-out samples.

Figures

Figures reproduced from arXiv: 2605.13587 by Camille No\^us, Denis Cornet, Gregory Beurier, Lauriane Rouan, Robin Reiter.

**Figure 2.** Figure 2: AOM mathematical structure. AOM-PLS exploits an operator identity on cross [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Median error reductions and win counts from the available headline result tables. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Search-budget contrast. Counts for PLS, Ridge, CatBoost and CNN-1D come from the [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

Near-infrared spectroscopy (NIRS) is rapid and non-destructive, but reliable calibration still depends heavily on spectral preprocessing. In routine practice, preprocessing is often selected by large external pipeline searches that are costly, unstable on small calibration sets, and difficult to audit. We introduce operator-adaptive calibration, a framework that moves linear preprocessing selection inside the calibration model. Candidate treatments are encoded as linear spectral operators, while nonlinear or sample-adaptive corrections such as SNV, MSC, and ASLS are handled as fold-local branches to prevent leakage. We instantiate the framework for PLS and Ridge regression. For PLS, covariance identities enable fast NIPALS and SIMPLS variants while preserving original-wavelength coefficients. For Ridge, operator-adaptive kernels yield a dual formulation with recoverable original-space coefficients. The approach was evaluated on more than 50 heterogeneous NIRS datasets against conventional PLS, Ridge, CatBoost, and CNN baselines under documented search budgets. Compact operator-adaptive PLS with ASLS branch preprocessing achieved a median RMSEP/PLS ratio of 0.960 with 42 wins on 57 datasets, while a deployable AOM-Ridge selector improved over tuned Ridge by a median 2.22% with 35 wins on 52 datasets. The proposed models reduce dependence on large preprocessing-HPO campaigns, produce traceable operator choices, retain interpretable coefficients, and fit in seconds for compact AOM-PLS. Operator-adaptive calibration therefore offers a practical route to faster, more robust, and more auditable NIRS method development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper internalizes preprocessing selection inside PLS and Ridge for NIRS via linear operators and fold-local branches, delivering modest benchmark gains but leaving the leakage risk on heterogeneous data unaddressed by ablation.

read the letter

The main takeaway is that the authors treat preprocessing as part of the model rather than a separate search step. Linear treatments become operators inside the covariance or kernel calculations for PLS and Ridge, while nonlinear corrections such as SNV, MSC, and ASLS are isolated to individual CV folds. This produces traceable choices and keeps coefficients in the original wavelength space, which is useful for NIRS work where interpretability matters.

Referee Report

2 major / 2 minor

Summary. The paper introduces an operator-adaptive calibration framework for NIRS that integrates the selection of spectral preprocessing operators directly into PLS and Ridge models. Linear operators use covariance identities for efficient computation in PLS and dual kernels for Ridge, while nonlinear corrections (SNV, MSC, ASLS) are treated as fold-local branches. On a benchmark of 57 datasets, the compact AOM-PLS with ASLS achieves a median RMSEP/PLS ratio of 0.960 with 42 wins, and AOM-Ridge shows a 2.22% median improvement with 35 wins on 52 datasets, compared to standard PLS, Ridge, CatBoost, and CNN.

Significance. If the no-leakage claim holds, the framework offers a practical, auditable alternative to external preprocessing searches in NIRS, reducing computational cost while retaining interpretable coefficients and traceable operator choices. The large-scale empirical benchmark across heterogeneous datasets is a strength, providing evidence of robustness over baselines. The use of standard covariance identities and dual formulations avoids circularity, but overall significance depends on confirming that fold-local branches do not introduce selection bias.

major comments (2)

[Methods (branch handling)] Methods section on branch handling: The assertion that encoding nonlinear corrections as fold-local branches fully prevents information leakage is load-bearing for the headline performance claims (median RMSEP/PLS ratio 0.960 with 42/57 wins; AOM-Ridge +2.22% with 35/52 wins). Without a nested-CV ablation that isolates the adaptive selector's contribution from the base regressor, it remains possible that per-fold choices overfit to partition-specific noise on heterogeneous NIRS sets.
[Results (win counts)] Results section (win counts and ratios): The reported median improvements and win counts are presented without statistical significance tests, confidence intervals, or sensitivity analysis to dataset splits; this weakens the cross-dataset robustness claim and requires clarification on how the 57 vs. 52 dataset counts were determined.

minor comments (2)

[Abstract] Abstract: Clarify the exact exclusion criteria that produce 57 datasets for PLS but only 52 for Ridge to ensure the benchmark is fully reproducible.
[Notation] Notation: Ensure the abbreviation 'AOM' (operator-adaptive model) is defined at first use and used consistently when referring to the PLS and Ridge variants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough and constructive comments on our manuscript. We provide point-by-point responses to the major comments below and outline the revisions we will make to address the concerns raised.

read point-by-point responses

Referee: [Methods (branch handling)] Methods section on branch handling: The assertion that encoding nonlinear corrections as fold-local branches fully prevents information leakage is load-bearing for the headline performance claims (median RMSEP/PLS ratio 0.960 with 42/57 wins; AOM-Ridge +2.22% with 35/52 wins). Without a nested-CV ablation that isolates the adaptive selector's contribution from the base regressor, it remains possible that per-fold choices overfit to partition-specific noise on heterogeneous NIRS sets.

Authors: We appreciate the referee's emphasis on this critical aspect. The fold-local branches are designed to avoid leakage by performing operator selection (such as choosing ASLS parameters) strictly within the training portion of each CV fold, using only the training data for any internal optimization or selection. The chosen operator is then fixed and applied to the test portion of the fold. This ensures that test data does not influence the preprocessing choice, consistent with best practices for preventing data leakage in cross-validation. To further address the potential for partition-specific overfitting, we will incorporate an additional ablation study using nested cross-validation in the revised manuscript. This will compare the full adaptive model against a version where the branch is fixed globally (selected on the entire training set) to isolate the benefit of the per-fold adaptation. revision: yes
Referee: [Results (win counts)] Results section (win counts and ratios): The reported median improvements and win counts are presented without statistical significance tests, confidence intervals, or sensitivity analysis to dataset splits; this weakens the cross-dataset robustness claim and requires clarification on how the 57 vs. 52 dataset counts were determined.

Authors: We agree that the presentation of results can be strengthened with statistical analysis. In the revision, we will add Wilcoxon signed-rank tests to assess the significance of the median improvements and report p-values for the win counts. We will also include bootstrap-derived confidence intervals for the median RMSEP ratios. For the dataset counts, the full set of 57 datasets was used for PLS models as they are applicable across all, while the 52 datasets for Ridge models exclude 5 cases where the kernel matrix was ill-conditioned due to high multicollinearity or insufficient samples for stable dual formulation; this will be explicitly documented in the Methods section. Furthermore, we will perform a sensitivity analysis by repeating the experiments with multiple random seeds for the data partitioning and reporting the variability in win counts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark relies on standard formulations

full rationale

The paper introduces operator-adaptive PLS and Ridge variants by encoding preprocessing as linear operators with fold-local branches for nonlinear corrections, then evaluates them via cross-validated RMSEP on 57+ NIRS datasets against baselines. The core derivations invoke standard covariance identities for PLS (NIPALS/SIMPLS) and dual kernels for Ridge to recover coefficients; these are not self-definitional or fitted-input predictions but established linear algebra results applied to the new operator encoding. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the central claims. Performance ratios (e.g., median 0.960) are computed directly from held-out predictions rather than reducing to input parameters by construction. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework assumes that preprocessing operations can be faithfully represented as linear spectral operators and that fold-local branching for nonlinear corrections like SNV/MSC/ASLS prevents leakage without compromising the model's ability to select effective treatments. No explicit free parameters are detailed in the abstract beyond the implicit choices in branch handling and regularization.

pith-pipeline@v0.9.0 · 5604 in / 1323 out tokens · 50271 ms · 2026-05-14T18:23:07.895277+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 14 canonical work pages

[1]

Kowalski , abstract =

Paul Geladi and Bruce R. Kowalski. Partial least-squares regression: a tutorial. Analytica Chimica Acta, 185: 0 1--17, 1986. doi:10.1016/0003-2670(86)80028-9

work page doi:10.1016/0003-2670(86)80028-9 1986
[2]

Pls-regression: a basic tool of chemometrics

Svante Wold, Michael Sj \"o str \"o m, and Lennart Eriksson. Pls-regression: a basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58 0 (2): 0 109--130, 2001. doi:10.1016/S0169-7439(01)00155-1

work page doi:10.1016/s0169-7439(01)00155-1 2001
[3]

Simpls: an alternative approach to partial least squares regression

Sijmen De Jong. Simpls: an alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18 0 (3): 0 251--263, 1993

1993
[4]

Hoerl and Robert W

Arthur E. Hoerl and Robert W. Kennard. Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12 0 (1): 0 55--67, 1970. doi:10.1080/00401706.1970.10488634

work page doi:10.1080/00401706.1970.10488634 1970
[5]

Review of the most common pre-processing techniques for near-infrared spectra

smund Rinnan, Frans van den Berg, and S ren Balling Engelsen. Review of the most common pre-processing techniques for near-infrared spectra. TrAC Trends in Analytical Chemistry, 28 0 (10): 0 1201--1222, 2009. doi:10.1016/j.trac.2009.07.007

work page doi:10.1016/j.trac.2009.07.007 2009
[6]

Abraham Savitzky and Marcel J. E. Golay. Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, 36 0 (8): 0 1627--1639, 1964. doi:10.1021/ac60214a047

work page doi:10.1021/ac60214a047 1964
[7]

R. J. Barnes, M. S. Dhanoa, and Susan J. Lister. Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. Applied Spectroscopy, 43 0 (5): 0 772--777, 1989

1989
[8]

Linearization and scatter-correction for near-infrared reflectance spectra of meat

Paul Geladi, Donald MacDougall, and Harald Martens. Linearization and scatter-correction for near-infrared reflectance spectra of meat. Applied Spectroscopy, 39 0 (3): 0 491--500, 1985. doi:10.1366/0003702854248656

work page doi:10.1366/0003702854248656 1985
[9]

Handbook of Near-Infrared Analysis , author =
[10]

Analytica Chimica Acta , volume =

Near infrared spectroscopy: A mature analytical technique with new perspectives , author =. Analytica Chimica Acta , volume =. 2018 , doi =

2018
[11]

Multivariate Calibration , author =
[12]

Analytica Chimica Acta , volume =

Partial least-squares regression: a tutorial , author =. Analytica Chimica Acta , volume =. 1986 , doi =

1986
[13]

Chemometrics and Intelligent Laboratory Systems , volume =

PLS-regression: a basic tool of chemometrics , author =. Chemometrics and Intelligent Laboratory Systems , volume =. 2001 , doi =

2001
[14]

Chemometrics and Intelligent Laboratory Systems , volume =

SIMPLS: an alternative approach to partial least squares regression , author =. Chemometrics and Intelligent Laboratory Systems , volume =
[15]

Technometrics , volume =

Ridge regression: biased estimation for nonorthogonal problems , author =. Technometrics , volume =. 1970 , doi =

1970
[16]

2009 , doi =

The Elements of Statistical Learning , author =. 2009 , doi =

2009
[17]

Learning with Kernels , author =
[18]

Analytical Chemistry , volume =

Smoothing and differentiation of data by simplified least squares procedures , author =. Analytical Chemistry , volume =. 1964 , doi =

1964
[19]

Cereal Chemistry , volume =

Influence of moisture content on the reflective behavior of grain , author =. Cereal Chemistry , volume =
[20]

Applied Spectroscopy , volume =

Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra , author =. Applied Spectroscopy , volume =
[21]

Applied Spectroscopy , volume =

Linearization and scatter-correction for near-infrared reflectance spectra of meat , author =. Applied Spectroscopy , volume =. 1985 , doi =

1985
[22]

Journal of Pharmaceutical and Biomedical Analysis , volume =

Extended multiplicative signal correction and spectral interference subtraction , author =. Journal of Pharmaceutical and Biomedical Analysis , volume =
[23]

Analytical Chemistry , volume =

A perfect smoother , author =. Analytical Chemistry , volume =. 2003 , doi =

2003
[24]

Analyst , volume =

Baseline correction using adaptive iteratively reweighted penalized least squares , author =. Analyst , volume =
[25]

TrAC Trends in Analytical Chemistry , volume =

Review of the most common pre-processing techniques for near-infrared spectra , author =. TrAC Trends in Analytical Chemistry , volume =. 2009 , doi =

2009
[26]

Technometrics , volume =

Computer aided design of experiments , author =. Technometrics , volume =
[27]

Talanta , volume =

A method for calibration and validation subset partitioning , author =. Talanta , volume =. 2005 , doi =

2005
[28]

Chemometrics and Intelligent Laboratory Systems , volume =

Modern practical convolutional neural networks for multivariate regression: Applications to NIR calibration , author =. Chemometrics and Intelligent Laboratory Systems , volume =. 2018 , doi =

2018
[29]

TrAC Trends in Analytical Chemistry , volume =

Deep learning for near-infrared spectral data modelling: Hypes and benefits , author =. TrAC Trends in Analytical Chemistry , volume =
[30]

Advances in Neural Information Processing Systems , volume =

CatBoost: unbiased boosting with categorical features , author =. Advances in Neural Information Processing Systems , volume =
[31]

Journal of Machine Learning Research , volume =

Scikit-learn: Machine learning in Python , author =. Journal of Machine Learning Research , volume =
[32]

Frontiers in Plant Science , volume =

A Perspective on Plant Phenomics: Coupling Deep Learning and Near-Infrared Spectroscopy , author =. Frontiers in Plant Science , volume =. 2022 , doi =

2022
[33]

and Desfontaines, Lucienne and Diman, Jean-Louis and Arnau, Gemma and Mestres, Christian and Davrieux, Fabrice and Rouan, Lauriane and Beurier, Gr

Houngbo, Mahugnon E. and Desfontaines, Lucienne and Diman, Jean-Louis and Arnau, Gemma and Mestres, Christian and Davrieux, Fabrice and Rouan, Lauriane and Beurier, Gr. Convolutional neural network allows amylose content prediction in yam (. Journal of the Science of Food and Agriculture , volume =. 2024 , doi =

2024
[34]

BMC Plant Biology , volume =

NIRSpredict: a platform for predicting plant traits from near infra-red spectroscopy , author =. BMC Plant Biology , volume =. 2024 , doi =

2024
[35]

nirs4all: Open spectroscopy for everyone , howpublished =

Beurier, Gr. nirs4all: Open spectroscopy for everyone , howpublished =
[36]

nirs4all Studio: Desktop workflows for near-infrared spectroscopy , howpublished =

Beurier, Gr. nirs4all Studio: Desktop workflows for near-infrared spectroscopy , howpublished =
[37]

Burns and Emil W

Donald A. Burns and Emil W. Ciurczak. Handbook of Near-Infrared Analysis. CRC Press, 3 edition, 2007

2007
[38]

Near infrared spectroscopy: A mature analytical technique with new perspectives

Celio Pasquini. Near infrared spectroscopy: A mature analytical technique with new perspectives. Analytica Chimica Acta, 1026: 0 8--36, 2018. doi:10.1016/j.aca.2018.04.004

work page doi:10.1016/j.aca.2018.04.004 2018
[39]

Bernhard Sch \"o lkopf and Alexander J. Smola. Learning with Kernels. MIT Press, 2002

2002
[40]

The Elements of Statistical Learning

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2 edition, 2009. doi:10.1007/978-0-387-84858-7

work page doi:10.1007/978-0-387-84858-7 2009
[41]

Norris and Phil C

Karl H. Norris and Phil C. Williams. Influence of moisture content on the reflective behavior of grain. Cereal Chemistry, 53 0 (6): 0 794--805, 1976

1976
[42]

Extended multiplicative signal correction and spectral interference subtraction

Harald Martens and Edward Stark. Extended multiplicative signal correction and spectral interference subtraction. Journal of Pharmaceutical and Biomedical Analysis, 9 0 (8): 0 625--635, 1991

1991
[43]

Paul H. C. Eilers. A perfect smoother. Analytical Chemistry, 75 0 (14): 0 3631--3636, 2003. doi:10.1021/ac034173t

work page doi:10.1021/ac034173t 2003
[44]

Baseline correction using adaptive iteratively reweighted penalized least squares

Zhi-Min Zhang, Shan Chen, and Yi-Zeng Liang. Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst, 135 0 (5): 0 1138--1146, 2010

2010
[45]

Catboost: unbiased boosting with categorical features

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, volume 31, 2018

2018
[46]

Modern practical convolutional neural networks for multivariate regression: Applications to nir calibration

Chenhao Cui and Tom Fearn. Modern practical convolutional neural networks for multivariate regression: Applications to nir calibration. Chemometrics and Intelligent Laboratory Systems, 182: 0 9--20, 2018. doi:10.1016/j.chemolab.2018.07.008

work page doi:10.1016/j.chemolab.2018.07.008 2018
[47]

Amigo, Aoife A

Puneet Mishra, Dario Passos, Federico Marini, Junli Xu, Jose M. Amigo, Aoife A. Gowen, Jeroen J. Jansen, Arnout Bouwmans, and Jean-Michel Roger. Deep learning for near-infrared spectral data modelling: Hypes and benefits. TrAC Trends in Analytical Chemistry, 157: 0 116804, 2022

2022
[48]

Gillespie, Etienne Baron, Elena Kazakou, Denis Vile, and Cyrille Violle

Fran c ois Vasseur, Denis Cornet, Gr \'e gory Beurier, Julie Messier, Lauriane Rouan, Justine Bresson, Martin Ecarnot, Mark Stahl, Simon Heumos, Marianne G \'e rard, Hans Reijnen, Pascal Tillard, Beno \^i t Lacombe, Am \'e lie Emanuel, Justine Floret, Aur \'e lien Estarague, Stefania Przybylska, Kevin Sartori, Lauren M. Gillespie, Etienne Baron, Elena Kaz...

work page doi:10.3389/fpls.2022.836488 2022
[49]

Mahugnon E. Houngbo, Lucienne Desfontaines, Jean-Louis Diman, Gemma Arnau, Christian Mestres, Fabrice Davrieux, Lauriane Rouan, Gr \'e gory Beurier, Carine Marie-Magdeleine, Karima Meghar, Emmanuel O. Alamu, Bolanle O. Otegbayo, and Denis Cornet. Convolutional neural network allows amylose content prediction in yam ( Dioscorea alata l.) flour using near i...

work page doi:10.1002/jsfa.12825 2024
[50]

Nirspredict: a platform for predicting plant traits from near infra-red spectroscopy

Axel Vaillant, Gr \'e gory Beurier, Denis Cornet, Lauriane Rouan, Denis Vile, Cyrille Violle, and Fran c ois Vasseur. Nirspredict: a platform for predicting plant traits from near infra-red spectroscopy. BMC Plant Biology, 24: 0 1100, 2024. doi:10.1186/s12870-024-05776-0

work page doi:10.1186/s12870-024-05776-0 2024
[51]

R. W. Kennard and L. A. Stone. Computer aided design of experiments. Technometrics, 11 0 (1): 0 137--148, 1969

1969
[52]

Roberto K. H. Galv \ a o, Mario C. U. Araujo, Gledson E. Jose, Marcio J. C. Pontes, Edvan C. Silva, and Teresa C. B. Saldanha. A method for calibration and validation subset partitioning. Talanta, 67 0 (4): 0 736--740, 2005. doi:10.1016/j.talanta.2005.03.025

work page doi:10.1016/j.talanta.2005.03.025 2005
[53]

nirs4all: Open spectroscopy for everyone

Gr \'e gory Beurier, Denis Cornet, and Lauriane Rouan. nirs4all: Open spectroscopy for everyone. https://github.com/GBeurier/nirs4all, 2026 a

2026
[54]

nirs4all studio: Desktop workflows for near-infrared spectroscopy

Gr \'e gory Beurier, Denis Cornet, and Lauriane Rouan. nirs4all studio: Desktop workflows for near-infrared spectroscopy. https://github.com/GBeurier/nirs4all-webapp, 2026 b

2026