Recognition: unknown
Reframing preprocessing selection as model-internal calibration in near-infrared spectroscopy: A large-scale benchmark of operator-adaptive PLS and Ridge models
Pith reviewed 2026-05-14 18:23 UTC · model grok-4.3
The pith
Operator-adaptive models that fold preprocessing selection inside calibration outperform standard PLS and Ridge on most NIRS datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Compact operator-adaptive PLS models that include ASLS branch preprocessing reach a median RMSEP ratio of 0.960 relative to ordinary PLS and win on 42 of 57 datasets; a deployable AOM-Ridge selector improves over tuned Ridge by a median 2.22 percent with 35 wins on 52 datasets. The framework encodes candidate linear preprocessing steps as spectral operators, uses covariance identities to keep PLS variants fast while retaining original-wavelength coefficients, and employs operator-adaptive kernels for a dual Ridge formulation whose coefficients remain recoverable in the original space.
What carries the argument
Operator-adaptive calibration, which treats candidate preprocessing steps as selectable linear spectral operators and confines nonlinear or sample-adaptive corrections to fold-local branches.
Load-bearing premise
That fold-local branches for nonlinear corrections fully prevent information leakage and avoid introducing bias or overfitting to the particular cross-validation splits.
What would settle it
Running the operator-adaptive models on a fresh collection of NIRS datasets and finding that their median performance falls below that of standard external preprocessing searches, or detecting leakage when the branches are examined for dependence on held-out samples.
Figures
read the original abstract
Near-infrared spectroscopy (NIRS) is rapid and non-destructive, but reliable calibration still depends heavily on spectral preprocessing. In routine practice, preprocessing is often selected by large external pipeline searches that are costly, unstable on small calibration sets, and difficult to audit. We introduce operator-adaptive calibration, a framework that moves linear preprocessing selection inside the calibration model. Candidate treatments are encoded as linear spectral operators, while nonlinear or sample-adaptive corrections such as SNV, MSC, and ASLS are handled as fold-local branches to prevent leakage. We instantiate the framework for PLS and Ridge regression. For PLS, covariance identities enable fast NIPALS and SIMPLS variants while preserving original-wavelength coefficients. For Ridge, operator-adaptive kernels yield a dual formulation with recoverable original-space coefficients. The approach was evaluated on more than 50 heterogeneous NIRS datasets against conventional PLS, Ridge, CatBoost, and CNN baselines under documented search budgets. Compact operator-adaptive PLS with ASLS branch preprocessing achieved a median RMSEP/PLS ratio of 0.960 with 42 wins on 57 datasets, while a deployable AOM-Ridge selector improved over tuned Ridge by a median 2.22% with 35 wins on 52 datasets. The proposed models reduce dependence on large preprocessing-HPO campaigns, produce traceable operator choices, retain interpretable coefficients, and fit in seconds for compact AOM-PLS. Operator-adaptive calibration therefore offers a practical route to faster, more robust, and more auditable NIRS method development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces an operator-adaptive calibration framework for NIRS that integrates the selection of spectral preprocessing operators directly into PLS and Ridge models. Linear operators use covariance identities for efficient computation in PLS and dual kernels for Ridge, while nonlinear corrections (SNV, MSC, ASLS) are treated as fold-local branches. On a benchmark of 57 datasets, the compact AOM-PLS with ASLS achieves a median RMSEP/PLS ratio of 0.960 with 42 wins, and AOM-Ridge shows a 2.22% median improvement with 35 wins on 52 datasets, compared to standard PLS, Ridge, CatBoost, and CNN.
Significance. If the no-leakage claim holds, the framework offers a practical, auditable alternative to external preprocessing searches in NIRS, reducing computational cost while retaining interpretable coefficients and traceable operator choices. The large-scale empirical benchmark across heterogeneous datasets is a strength, providing evidence of robustness over baselines. The use of standard covariance identities and dual formulations avoids circularity, but overall significance depends on confirming that fold-local branches do not introduce selection bias.
major comments (2)
- [Methods (branch handling)] Methods section on branch handling: The assertion that encoding nonlinear corrections as fold-local branches fully prevents information leakage is load-bearing for the headline performance claims (median RMSEP/PLS ratio 0.960 with 42/57 wins; AOM-Ridge +2.22% with 35/52 wins). Without a nested-CV ablation that isolates the adaptive selector's contribution from the base regressor, it remains possible that per-fold choices overfit to partition-specific noise on heterogeneous NIRS sets.
- [Results (win counts)] Results section (win counts and ratios): The reported median improvements and win counts are presented without statistical significance tests, confidence intervals, or sensitivity analysis to dataset splits; this weakens the cross-dataset robustness claim and requires clarification on how the 57 vs. 52 dataset counts were determined.
minor comments (2)
- [Abstract] Abstract: Clarify the exact exclusion criteria that produce 57 datasets for PLS but only 52 for Ridge to ensure the benchmark is fully reproducible.
- [Notation] Notation: Ensure the abbreviation 'AOM' (operator-adaptive model) is defined at first use and used consistently when referring to the PLS and Ridge variants.
Simulated Author's Rebuttal
We thank the referee for their thorough and constructive comments on our manuscript. We provide point-by-point responses to the major comments below and outline the revisions we will make to address the concerns raised.
read point-by-point responses
-
Referee: [Methods (branch handling)] Methods section on branch handling: The assertion that encoding nonlinear corrections as fold-local branches fully prevents information leakage is load-bearing for the headline performance claims (median RMSEP/PLS ratio 0.960 with 42/57 wins; AOM-Ridge +2.22% with 35/52 wins). Without a nested-CV ablation that isolates the adaptive selector's contribution from the base regressor, it remains possible that per-fold choices overfit to partition-specific noise on heterogeneous NIRS sets.
Authors: We appreciate the referee's emphasis on this critical aspect. The fold-local branches are designed to avoid leakage by performing operator selection (such as choosing ASLS parameters) strictly within the training portion of each CV fold, using only the training data for any internal optimization or selection. The chosen operator is then fixed and applied to the test portion of the fold. This ensures that test data does not influence the preprocessing choice, consistent with best practices for preventing data leakage in cross-validation. To further address the potential for partition-specific overfitting, we will incorporate an additional ablation study using nested cross-validation in the revised manuscript. This will compare the full adaptive model against a version where the branch is fixed globally (selected on the entire training set) to isolate the benefit of the per-fold adaptation. revision: yes
-
Referee: [Results (win counts)] Results section (win counts and ratios): The reported median improvements and win counts are presented without statistical significance tests, confidence intervals, or sensitivity analysis to dataset splits; this weakens the cross-dataset robustness claim and requires clarification on how the 57 vs. 52 dataset counts were determined.
Authors: We agree that the presentation of results can be strengthened with statistical analysis. In the revision, we will add Wilcoxon signed-rank tests to assess the significance of the median improvements and report p-values for the win counts. We will also include bootstrap-derived confidence intervals for the median RMSEP ratios. For the dataset counts, the full set of 57 datasets was used for PLS models as they are applicable across all, while the 52 datasets for Ridge models exclude 5 cases where the kernel matrix was ill-conditioned due to high multicollinearity or insufficient samples for stable dual formulation; this will be explicitly documented in the Methods section. Furthermore, we will perform a sensitivity analysis by repeating the experiments with multiple random seeds for the data partitioning and reporting the variability in win counts. revision: yes
Circularity Check
No significant circularity; empirical benchmark relies on standard formulations
full rationale
The paper introduces operator-adaptive PLS and Ridge variants by encoding preprocessing as linear operators with fold-local branches for nonlinear corrections, then evaluates them via cross-validated RMSEP on 57+ NIRS datasets against baselines. The core derivations invoke standard covariance identities for PLS (NIPALS/SIMPLS) and dual kernels for Ridge to recover coefficients; these are not self-definitional or fitted-input predictions but established linear algebra results applied to the new operator encoding. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the central claims. Performance ratios (e.g., median 0.960) are computed directly from held-out predictions rather than reducing to input parameters by construction. The framework is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Paul Geladi and Bruce R. Kowalski. Partial least-squares regression: a tutorial. Analytica Chimica Acta, 185: 0 1--17, 1986. doi:10.1016/0003-2670(86)80028-9
-
[2]
Pls-regression: a basic tool of chemometrics
Svante Wold, Michael Sj \"o str \"o m, and Lennart Eriksson. Pls-regression: a basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58 0 (2): 0 109--130, 2001. doi:10.1016/S0169-7439(01)00155-1
-
[3]
Simpls: an alternative approach to partial least squares regression
Sijmen De Jong. Simpls: an alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18 0 (3): 0 251--263, 1993
1993
-
[4]
Arthur E. Hoerl and Robert W. Kennard. Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12 0 (1): 0 55--67, 1970. doi:10.1080/00401706.1970.10488634
-
[5]
Review of the most common pre-processing techniques for near-infrared spectra
smund Rinnan, Frans van den Berg, and S ren Balling Engelsen. Review of the most common pre-processing techniques for near-infrared spectra. TrAC Trends in Analytical Chemistry, 28 0 (10): 0 1201--1222, 2009. doi:10.1016/j.trac.2009.07.007
-
[6]
Abraham Savitzky and Marcel J. E. Golay. Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, 36 0 (8): 0 1627--1639, 1964. doi:10.1021/ac60214a047
-
[7]
R. J. Barnes, M. S. Dhanoa, and Susan J. Lister. Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. Applied Spectroscopy, 43 0 (5): 0 772--777, 1989
1989
-
[8]
Linearization and scatter-correction for near-infrared reflectance spectra of meat
Paul Geladi, Donald MacDougall, and Harald Martens. Linearization and scatter-correction for near-infrared reflectance spectra of meat. Applied Spectroscopy, 39 0 (3): 0 491--500, 1985. doi:10.1366/0003702854248656
-
[9]
Handbook of Near-Infrared Analysis , author =
-
[10]
Analytica Chimica Acta , volume =
Near infrared spectroscopy: A mature analytical technique with new perspectives , author =. Analytica Chimica Acta , volume =. 2018 , doi =
2018
-
[11]
Multivariate Calibration , author =
-
[12]
Analytica Chimica Acta , volume =
Partial least-squares regression: a tutorial , author =. Analytica Chimica Acta , volume =. 1986 , doi =
1986
-
[13]
Chemometrics and Intelligent Laboratory Systems , volume =
PLS-regression: a basic tool of chemometrics , author =. Chemometrics and Intelligent Laboratory Systems , volume =. 2001 , doi =
2001
-
[14]
Chemometrics and Intelligent Laboratory Systems , volume =
SIMPLS: an alternative approach to partial least squares regression , author =. Chemometrics and Intelligent Laboratory Systems , volume =
-
[15]
Technometrics , volume =
Ridge regression: biased estimation for nonorthogonal problems , author =. Technometrics , volume =. 1970 , doi =
1970
-
[16]
2009 , doi =
The Elements of Statistical Learning , author =. 2009 , doi =
2009
-
[17]
Learning with Kernels , author =
-
[18]
Analytical Chemistry , volume =
Smoothing and differentiation of data by simplified least squares procedures , author =. Analytical Chemistry , volume =. 1964 , doi =
1964
-
[19]
Cereal Chemistry , volume =
Influence of moisture content on the reflective behavior of grain , author =. Cereal Chemistry , volume =
-
[20]
Applied Spectroscopy , volume =
Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra , author =. Applied Spectroscopy , volume =
-
[21]
Applied Spectroscopy , volume =
Linearization and scatter-correction for near-infrared reflectance spectra of meat , author =. Applied Spectroscopy , volume =. 1985 , doi =
1985
-
[22]
Journal of Pharmaceutical and Biomedical Analysis , volume =
Extended multiplicative signal correction and spectral interference subtraction , author =. Journal of Pharmaceutical and Biomedical Analysis , volume =
-
[23]
Analytical Chemistry , volume =
A perfect smoother , author =. Analytical Chemistry , volume =. 2003 , doi =
2003
-
[24]
Analyst , volume =
Baseline correction using adaptive iteratively reweighted penalized least squares , author =. Analyst , volume =
-
[25]
TrAC Trends in Analytical Chemistry , volume =
Review of the most common pre-processing techniques for near-infrared spectra , author =. TrAC Trends in Analytical Chemistry , volume =. 2009 , doi =
2009
-
[26]
Technometrics , volume =
Computer aided design of experiments , author =. Technometrics , volume =
-
[27]
Talanta , volume =
A method for calibration and validation subset partitioning , author =. Talanta , volume =. 2005 , doi =
2005
-
[28]
Chemometrics and Intelligent Laboratory Systems , volume =
Modern practical convolutional neural networks for multivariate regression: Applications to NIR calibration , author =. Chemometrics and Intelligent Laboratory Systems , volume =. 2018 , doi =
2018
-
[29]
TrAC Trends in Analytical Chemistry , volume =
Deep learning for near-infrared spectral data modelling: Hypes and benefits , author =. TrAC Trends in Analytical Chemistry , volume =
-
[30]
Advances in Neural Information Processing Systems , volume =
CatBoost: unbiased boosting with categorical features , author =. Advances in Neural Information Processing Systems , volume =
-
[31]
Journal of Machine Learning Research , volume =
Scikit-learn: Machine learning in Python , author =. Journal of Machine Learning Research , volume =
-
[32]
Frontiers in Plant Science , volume =
A Perspective on Plant Phenomics: Coupling Deep Learning and Near-Infrared Spectroscopy , author =. Frontiers in Plant Science , volume =. 2022 , doi =
2022
-
[33]
and Desfontaines, Lucienne and Diman, Jean-Louis and Arnau, Gemma and Mestres, Christian and Davrieux, Fabrice and Rouan, Lauriane and Beurier, Gr
Houngbo, Mahugnon E. and Desfontaines, Lucienne and Diman, Jean-Louis and Arnau, Gemma and Mestres, Christian and Davrieux, Fabrice and Rouan, Lauriane and Beurier, Gr. Convolutional neural network allows amylose content prediction in yam (. Journal of the Science of Food and Agriculture , volume =. 2024 , doi =
2024
-
[34]
BMC Plant Biology , volume =
NIRSpredict: a platform for predicting plant traits from near infra-red spectroscopy , author =. BMC Plant Biology , volume =. 2024 , doi =
2024
-
[35]
nirs4all: Open spectroscopy for everyone , howpublished =
Beurier, Gr. nirs4all: Open spectroscopy for everyone , howpublished =
-
[36]
nirs4all Studio: Desktop workflows for near-infrared spectroscopy , howpublished =
Beurier, Gr. nirs4all Studio: Desktop workflows for near-infrared spectroscopy , howpublished =
-
[37]
Burns and Emil W
Donald A. Burns and Emil W. Ciurczak. Handbook of Near-Infrared Analysis. CRC Press, 3 edition, 2007
2007
-
[38]
Near infrared spectroscopy: A mature analytical technique with new perspectives
Celio Pasquini. Near infrared spectroscopy: A mature analytical technique with new perspectives. Analytica Chimica Acta, 1026: 0 8--36, 2018. doi:10.1016/j.aca.2018.04.004
-
[39]
Bernhard Sch \"o lkopf and Alexander J. Smola. Learning with Kernels. MIT Press, 2002
2002
-
[40]
The Elements of Statistical Learning
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2 edition, 2009. doi:10.1007/978-0-387-84858-7
-
[41]
Norris and Phil C
Karl H. Norris and Phil C. Williams. Influence of moisture content on the reflective behavior of grain. Cereal Chemistry, 53 0 (6): 0 794--805, 1976
1976
-
[42]
Extended multiplicative signal correction and spectral interference subtraction
Harald Martens and Edward Stark. Extended multiplicative signal correction and spectral interference subtraction. Journal of Pharmaceutical and Biomedical Analysis, 9 0 (8): 0 625--635, 1991
1991
-
[43]
Paul H. C. Eilers. A perfect smoother. Analytical Chemistry, 75 0 (14): 0 3631--3636, 2003. doi:10.1021/ac034173t
-
[44]
Baseline correction using adaptive iteratively reweighted penalized least squares
Zhi-Min Zhang, Shan Chen, and Yi-Zeng Liang. Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst, 135 0 (5): 0 1138--1146, 2010
2010
-
[45]
Catboost: unbiased boosting with categorical features
Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, volume 31, 2018
2018
-
[46]
Chenhao Cui and Tom Fearn. Modern practical convolutional neural networks for multivariate regression: Applications to nir calibration. Chemometrics and Intelligent Laboratory Systems, 182: 0 9--20, 2018. doi:10.1016/j.chemolab.2018.07.008
-
[47]
Amigo, Aoife A
Puneet Mishra, Dario Passos, Federico Marini, Junli Xu, Jose M. Amigo, Aoife A. Gowen, Jeroen J. Jansen, Arnout Bouwmans, and Jean-Michel Roger. Deep learning for near-infrared spectral data modelling: Hypes and benefits. TrAC Trends in Analytical Chemistry, 157: 0 116804, 2022
2022
-
[48]
Gillespie, Etienne Baron, Elena Kazakou, Denis Vile, and Cyrille Violle
Fran c ois Vasseur, Denis Cornet, Gr \'e gory Beurier, Julie Messier, Lauriane Rouan, Justine Bresson, Martin Ecarnot, Mark Stahl, Simon Heumos, Marianne G \'e rard, Hans Reijnen, Pascal Tillard, Beno \^i t Lacombe, Am \'e lie Emanuel, Justine Floret, Aur \'e lien Estarague, Stefania Przybylska, Kevin Sartori, Lauren M. Gillespie, Etienne Baron, Elena Kaz...
-
[49]
Mahugnon E. Houngbo, Lucienne Desfontaines, Jean-Louis Diman, Gemma Arnau, Christian Mestres, Fabrice Davrieux, Lauriane Rouan, Gr \'e gory Beurier, Carine Marie-Magdeleine, Karima Meghar, Emmanuel O. Alamu, Bolanle O. Otegbayo, and Denis Cornet. Convolutional neural network allows amylose content prediction in yam ( Dioscorea alata l.) flour using near i...
-
[50]
Nirspredict: a platform for predicting plant traits from near infra-red spectroscopy
Axel Vaillant, Gr \'e gory Beurier, Denis Cornet, Lauriane Rouan, Denis Vile, Cyrille Violle, and Fran c ois Vasseur. Nirspredict: a platform for predicting plant traits from near infra-red spectroscopy. BMC Plant Biology, 24: 0 1100, 2024. doi:10.1186/s12870-024-05776-0
-
[51]
R. W. Kennard and L. A. Stone. Computer aided design of experiments. Technometrics, 11 0 (1): 0 137--148, 1969
1969
-
[52]
Roberto K. H. Galv \ a o, Mario C. U. Araujo, Gledson E. Jose, Marcio J. C. Pontes, Edvan C. Silva, and Teresa C. B. Saldanha. A method for calibration and validation subset partitioning. Talanta, 67 0 (4): 0 736--740, 2005. doi:10.1016/j.talanta.2005.03.025
-
[53]
nirs4all: Open spectroscopy for everyone
Gr \'e gory Beurier, Denis Cornet, and Lauriane Rouan. nirs4all: Open spectroscopy for everyone. https://github.com/GBeurier/nirs4all, 2026 a
2026
-
[54]
nirs4all studio: Desktop workflows for near-infrared spectroscopy
Gr \'e gory Beurier, Denis Cornet, and Lauriane Rouan. nirs4all studio: Desktop workflows for near-infrared spectroscopy. https://github.com/GBeurier/nirs4all-webapp, 2026 b
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.