The Infinite-Dimensional Nature of Spectroscopy and Why Models Succeed, Fail, and Mislead

Francesca Venturini; Umberto Michelucci

arxiv: 2604.04717 · v1 · submitted 2026-04-06 · 💻 cs.LG · cond-mat.mtrl-sci· cs.AI· stat.ML

The Infinite-Dimensional Nature of Spectroscopy and Why Models Succeed, Fail, and Mislead

Umberto Michelucci , Francesca Venturini This is my paper

Pith reviewed 2026-05-10 18:53 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.mtrl-scics.AIstat.ML

keywords spectroscopymachine learninghigh dimensionalityconcentration of measurefluorescence spectramodel interpretabilitydistributional differencesFeldman-Hajek theorem

0 comments

The pith

Spectroscopic data's infinite dimensionality lets ML models separate even infinitesimal noise differences with perfect accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that spectroscopic datasets behave as if they live in infinite-dimensional spaces, where tiny distributional shifts from noise, normalization, or instruments become linearly separable. This explains why models reach near-perfect classification accuracy on fluorescence spectra even when no chemical differences exist between classes. It also accounts for why feature-importance maps often flag spectrally irrelevant regions. The authors ground the argument in the Feldman-Hajek theorem and concentration of measure, then confirm the effect with controlled synthetic and real experiments. A sympathetic reader cares because this mechanism can produce apparently successful models that fail to generalize or reveal actual chemistry.

Core claim

Even infinitesimal distributional differences in spectral data, whether from noise, preprocessing, or artefacts, become perfectly separable in high-dimensional spaces according to the Feldman-Hajek theorem and concentration of measure; experiments on synthetic and real fluorescence spectra demonstrate that models achieve near-perfect accuracy without any underlying chemical distinctions, while feature-importance maps highlight regions unrelated to the spectra's chemical content.

What carries the argument

The effective infinite-dimensional character of spectral data, which amplifies minuscule distributional shifts into perfect separability via the Feldman-Hajek theorem and concentration of measure.

If this is right

ML classifiers can reach high reported accuracy on spectroscopic tasks without learning chemically meaningful features.
Feature-importance methods will frequently highlight regions that carry no chemical information.
Standard cross-validation may not detect when separation relies on artefacts rather than chemistry.
Preprocessing steps such as normalization can create or amplify spurious separability.
Model reliability in spectroscopy requires explicit checks for whether separation persists under controlled chemical conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners may need dimensionality-reduction or representation-learning steps that explicitly preserve chemical invariants before applying classifiers.
Similar high-dimensional separation effects could appear in other scientific domains that produce dense spectral or waveform data.
Interpretability techniques beyond feature importance, such as counterfactual generation under chemical constraints, become necessary to distinguish artefact-driven from chemistry-driven decisions.

Load-bearing premise

Real spectroscopic measurements behave enough like infinite-dimensional random variables whose differences are purely distributional rather than tied to underlying chemical structure.

What would settle it

A controlled experiment on fluorescence spectra where class labels are assigned randomly or by non-chemical criteria yet models trained after standard preprocessing still fail to reach near-perfect accuracy.

Figures

Figures reproduced from arXiv: 2604.04717 by Francesca Venturini, Umberto Michelucci.

**Figure 1.** Figure 1: Illustration of the concentration of measure for multivariate Gaussian distributions. Shown are the empirical distributions of ∥x∥2 for samples drawn from N (0, 1.0 2 In) (light blue) and N (0, 1.1 2 In) (yellow), for increasing dimensionalities n = 2, 50, 500, and 5000 (panels AD). In low dimensions the two distributions overlap substantially, but as n increases the probability mass concentrates sharply a… view at source ↗

**Figure 2.** Figure 2: Ten representative synthetic one-peak spectra per class used in the synthetic-spectra experiments. Each curve is a Lorentzian profile sampled on an n = 100-point axis, with peak centre jittered as c ∼ N (50, 102 ). Class 0 (blue) and Class 1 (orange) differ only through the FWHM ξ1 = 7 vs. ξ2 = 9, illustrating that the two classes are visually difficult to distinguish despite being statistically separable … view at source ↗

**Figure 3.** Figure 3: Fluorescence spectra of Spanish olive oil samples classified as Extra Virgin (EVOO), Virgin (VOO), and Lampante (LOO). The region 380–420 nm indicates the Rayleigh scattering peak from the excitation LED. The black line indicates the average spectrum for each class. evaluated the LDA decision boundary T for this setting T = n log ( σ 2 2 σ 2 1 ) 1 1 σ 2 2 − 1 σ 2 1 (5) which classifies a sample by threshol… view at source ↗

**Figure 4.** Figure 4: Results from experiment N1. Classification accuracy of QDA (regularisation parameter equal to 0.4) as a function of the standard-deviation gap ∆σ between two white-noise classes with equal mean µ = 1 and baseline σ1 = 1. Each curve corresponds to a different number of points per array (n ∈ [5, 10, 50, 500]); at each ∆σ, N arrays per class are generated and split 80/20 into train/test. The dashed line at 1.… view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Results from experiment N3. Test accuracy of QDA (reg_param = 0.4) versus the number of points per array n for two Gaussian white-noise classes with common mean µ and variances σ 2 1 In vs. σ 2 2 In. Each curve corresponds to a different standard-deviation gap ∆σ = σ2 − σ1. Datasets contain N arrays per class and are split 80/20 into train/test. The dashed horizontal line at 1.0 indicates perfect accuracy.… view at source ↗

**Figure 7.** Figure 7: Results from experiment N4. Crossvalidated classification accuracy for four models (from top to bottom: Random Forest, kNN, Decision Tree, Logistic Regression) on synthetic data drawn from a skewnormal distribution in dimension n = 50. Each column sweeps two parameters while holding the third at its baseline: (A) ∆µ/µ1 vs. ∆σ/σ1, (B) ∆µ/µ1 vs. ∆γ/γ1, (C) ∆σ/σ1 vs. ∆γ/γ1, where γ is the skew (shape) paramet… view at source ↗

**Figure 8.** Figure 8: Results from experiment S2. Model validation accuracy (mean ±1 SD over 5-fold CV) versus spectrum length n (log scale) for four classifiers: logistic regression, k-NN, decision tree (max depth 5), and random forest (100 trees). Data consist of synthetic onepeak Lorentzian spectra: the two classes differ only in width (ξ1 = 7 vs. ξ2 = 9) while peak centres are jittered c ∼ N (50, 102 ); 500 spectra per clas… view at source ↗

**Figure 9.** Figure 9: Results from experiment S3. Validation accuracy (mean ±1 SD over 5-fold CV) versus spectrum length n (log scale) for four classifiers on synthetic onepeak spectra with identical signal distributions but class-specific additive noise. Each class contains 500 spectra with Lorentzian FWHM ξ = 7 and centres jittered c ∼ N (50, 102 ); noise is i.i.d. Gaussian with mean 0 (class 0) or 0.01 (class 1) and SD 0.01.… view at source ↗

**Figure 10.** Figure 10: Results for experiment Ra3/Rb3. LOO-CV classification accuracy as a function of the number of randomly selected pixels (k) from the spectral noise region (pixels 0–50). Panel (A) shows the results for EVOO vs. LOO, and Panel (B) for EVOO vs. VOO. Each data point represents the mean accuracy over 20 independent random subsets, with error bars indicating the standard deviation. The rapid climb to accuracies… view at source ↗

**Figure 11.** Figure 11: Empirical covariance matrices of the fluorescence spectra for the two olive oil classes (EXTRA and LAMPANTE). Bright red areas correspond to regions of strong inter-wavelength covariances, notably around the main fluorescence peak and stray-light regions. Such covariance mismatches are sufficient, in high-dimensional space, to enable nearly perfect classification even when chemically meaningful informati… view at source ↗

**Figure 12.** Figure 12: Results for experiment Ra4/Rb4. LOO-CV classification accuracy is mapped across the fluorescence spectrum (gray line) using non-overlapping windows of increasing size W. Left Column (Panels A, C, E, G): Experiment EVOO vs. LOO. Right Column (Panels B, D, F, H): Experiment EVOO vs. VOO. The spectral region between 380– 420 nm was explicitly removed to eliminate the Rayleigh scattering peak as a trivial di… view at source ↗

**Figure 13.** Figure 13: Results for experiment Ra5/Rb5. Regional Feature Attribution Map. Mean absolute SHAP values are presented in arbitrary units (a.u.) to facilitate the comparison of relative feature importance across varying window sizes (W). Left Column (Panels A, C, E, G): Experiment EVOO vs. LOO. Right Column (Panels B, D, F, H): Experiment EVOO vs. VOO. The spectral region between 380–420 nm was explicitly removed to e… view at source ↗

read the original abstract

Machine learning (ML) models have achieved strikingly high accuracies in spectroscopic classification tasks, often without a clear proof that those models used chemically meaningful features. Existing studies have linked these results to data preprocessing choices, noise sensitivity, and model complexity, but no unifying explanation is available so far. In this work, we show that these phenomena arise naturally from the intrinsic high dimensionality of spectral data. Using a theoretical analysis grounded in the Feldman-Hajek theorem and the concentration of measure, we show that even infinitesimal distributional differences, caused by noise, normalisation, or instrumental artefacts, may become perfectly separable in high-dimensional spaces. Through a series of specific experiments on synthetic and real fluorescence spectra, we illustrate how models can achieve near-perfect accuracy even when chemical distinctions are absent, and why feature-importance maps may highlight spectrally irrelevant regions. We provide a rigorous theoretical framework, confirm the effect experimentally, and conclude with practical recommendations for building and interpreting ML models in spectroscopy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses Feldman-Hajek and concentration of measure to explain why ML models hit high accuracy on spectra without learning chemistry, but the infinite-dimensional framing does not map cleanly onto finite spectral vectors.

read the letter

The main point is that this paper offers a measure-theoretic account for the common observation that ML classifiers on spectroscopic data reach near-perfect accuracy while relying on artifacts rather than chemical signals. It ties this to the Feldman-Hajek theorem on singularity of Gaussian measures and to concentration of measure, claiming that even tiny distributional shifts from noise, normalization, or instruments become perfectly separable in high dimensions. The experiments on synthetic and real fluorescence spectra are meant to show models succeeding when no real chemical difference exists and feature maps lighting up irrelevant regions. That framing is new for this application area and the experiments give a concrete illustration of the practical warning. The authors also close with usable recommendations on validation practices. The soft spot is exactly the one flagged in the stress-test note. Spectra are finite-length vectors in R^D with D typically a few hundred to a couple thousand, not measures on infinite-dimensional Hilbert space. The paper does not show that the observed perturbations place the shift outside the Cameron-Martin space or change the covariance operator in the precise way needed for mutual singularity under Feldman-Hajek. Concentration arguments can produce easy separation in large but finite D without requiring the measures to be singular. The central claim therefore rests on an assumption that is not verified for the actual data. The experiments still demonstrate a real phenomenon worth worrying about. This paper is for applied researchers who train or review ML models on spectra and want a theoretical lens on why standard validation can mislead. It deserves a serious referee because the practical concern is important and the experiments are reproducible; the theoretical grounding simply needs tightening to match the finite-dimensional setting.

Referee Report

2 major / 2 minor

Summary. The paper claims that the strikingly high accuracies of ML models on spectroscopic classification arise from the intrinsic high (effectively infinite) dimensionality of spectral data. Grounded in the Feldman-Hajek theorem and concentration-of-measure phenomena, it argues that even infinitesimal distributional shifts induced by noise, normalization, or instrumental artifacts become perfectly separable, allowing models to succeed without using chemically meaningful features; this is illustrated via experiments on synthetic and real fluorescence spectra, with accompanying practical recommendations for model building and interpretation.

Significance. If the central theoretical link holds, the work supplies a unifying explanation for otherwise puzzling ML behaviors in spectroscopy, including sensitivity to preprocessing and misleading feature attributions. It applies standard infinite-dimensional probability results to a concrete domain and backs the argument with targeted experiments, which could usefully inform best practices and caution against over-interpreting black-box models on spectral data.

major comments (2)

[Theoretical analysis] Theoretical analysis section: the claim that infinitesimal distributional differences become perfectly separable rests on the Feldman-Hajek theorem for mutually singular Gaussian measures on Hilbert space. Real fluorescence spectra, however, are finite-dimensional vectors in R^D (D typically 500–2000 after discretization). The manuscript must explicitly verify that the observed perturbations (noise, normalization) place the mean shift outside the Cameron-Martin space or alter the covariance operator in the manner required for singularity; otherwise the “perfect separability” conclusion does not follow from the cited theorem.
[Experiments] Experimental section on synthetic and real spectra: the reported near-perfect accuracies are consistent with concentration-of-measure effects in large but finite D, yet the experiments do not include a controlled comparison that isolates the infinite-dimensional singularity mechanism (e.g., by varying D while holding chemical content fixed, or by checking equivalence vs. singularity of the induced measures). Without such controls, it remains possible that the observed separability arises from ordinary finite-dimensional geometry rather than the Feldman-Hajek phenomenon.

minor comments (2)

[Abstract and introduction] The abstract states that the work provides “a rigorous theoretical framework,” but the main text should include the precise statement of the Feldman-Hajek conditions that are being invoked and the mapping from spectral preprocessing steps to those conditions.
[Notation and definitions] Notation for the spectral measures and the covariance operators should be introduced once and used consistently; several passages refer to “distributional differences” without specifying whether these are in total variation, Hellinger, or another metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and limitations of our theoretical framework. We respond to each major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [Theoretical analysis] Theoretical analysis section: the claim that infinitesimal distributional differences become perfectly separable rests on the Feldman-Hajek theorem for mutually singular Gaussian measures on Hilbert space. Real fluorescence spectra, however, are finite-dimensional vectors in R^D (D typically 500–2000 after discretization). The manuscript must explicitly verify that the observed perturbations (noise, normalization) place the mean shift outside the Cameron-Martin space or alter the covariance operator in the manner required for singularity; otherwise the “perfect separability” conclusion does not follow from the cited theorem.

Authors: We agree that the Feldman-Hajek theorem applies strictly in infinite-dimensional Hilbert spaces and that discretized spectra live in finite-dimensional R^D. Our argument treats the high-D regime (D ≫ 1) as an effective approximation to the infinite-dimensional case, where concentration-of-measure effects make even small shifts in mean or covariance produce near-singular measures. We will revise the theoretical section to (i) state the finite-D limitation explicitly, (ii) recall the precise Cameron-Martin condition for Gaussian singularity, and (iii) provide a brief calculation showing that typical normalization and additive-noise perturbations in fluorescence spectra satisfy the required shift outside the Cameron-Martin space for the covariance operators we consider. This will make the link between the theorem and the observed separability rigorous within the finite-D setting. revision: yes
Referee: [Experiments] Experimental section on synthetic and real spectra: the reported near-perfect accuracies are consistent with concentration-of-measure effects in large but finite D, yet the experiments do not include a controlled comparison that isolates the infinite-dimensional singularity mechanism (e.g., by varying D while holding chemical content fixed, or by checking equivalence vs. singularity of the induced measures). Without such controls, it remains possible that the observed separability arises from ordinary finite-dimensional geometry rather than the Feldman-Hajek phenomenon.

Authors: The synthetic-data experiments already allow D to be varied while keeping the underlying chemical signal fixed; we will add an explicit panel (or supplementary figure) that plots classification accuracy against increasing D for fixed noise and normalization levels. This will demonstrate the transition toward perfect separability as D grows, consistent with the concentration-of-measure and Feldman-Hajek limits. For the real spectra we will include a short discussion noting that the observed D (≈ 1000) already places the data in the regime where the finite-dimensional geometry approximates the infinite-dimensional singularity. These additions directly address the request for a controlled isolation of the mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims rest on external theorems

full rationale

The paper derives its main result from the Feldman-Hajek theorem on equivalence/singularity of Gaussian measures on Hilbert space and the concentration-of-measure phenomenon, both standard external mathematical results independent of the present work. The abstract and description explicitly ground the separability claim in these theorems rather than in any self-defined quantity, fitted parameter, or prior self-citation. Experiments on synthetic and real spectra serve only as illustration, not as definitional inputs that are then re-predicted. No load-bearing step reduces by construction to the paper's own outputs or to a self-citation chain; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two standard mathematical results with no free parameters or new entities introduced.

axioms (2)

standard math Feldman-Hajek theorem
Invoked to establish when two measures on infinite-dimensional spaces are mutually singular and thus perfectly separable.
standard math Concentration of measure phenomenon
Used to argue that small distributional perturbations become large separations in high dimensions.

pith-pipeline@v0.9.0 · 5473 in / 1329 out tokens · 36468 ms · 2026-05-10T18:53:47.675354+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 forcing) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using a theoretical analysis grounded in the Feldman-Hájek theorem and the concentration of measure, we show that even infinitesimal distributional differences... may become perfectly separable in high-dimensional spaces.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The Feldman-Hájek theorem... in infinite (or with a good approximation in very high) dimensions, even the smallest difference in mean or covariance makes the two distributions mutually singular.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery / embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Gaussian mixture dichotomy theorem... in the limit of infinite dimensions we can expect an almost perfect classification for basically every dataset.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Chemometrics in spectroscopy

Howard Mark and Jerry Workman Jr. Chemometrics in spectroscopy. Elsevier, 2010

work page 2010
[2]

Camilo L. M. Morais, Kássio M. G. Lima, Maneesh Singh, and Francis L. Martin. Tutorial: multivariate clas- siﬁcation for vibrational spectroscopy in biological samples. Nature Protocols, 15(7):2143–2162, July 2020. Publisher: Nature Publishing Group

work page 2020
[3]

Chemometric analysis in raman spectroscopy from experimen- tal design to machine learning–based modeling

Shuxia Guo, Jürgen Popp, and Thomas Bocklitz. Chemometric analysis in raman spectroscopy from experimen- tal design to machine learning–based modeling. Nature protocols, 16(12):5426–5459, 2021

work page 2021
[4]

Demystifying the Black Box: Making Machine Learning Mod- els Explainable in Spectroscopy

Jerome Workman Jr. Demystifying the Black Box: Making Machine Learning Mod- els Explainable in Spectroscopy. https://www.spectroscopyonline.com/view/ demystifying-the-black-box-making-machine-learning-models-explainable-in-spectroscopy , September 2025. Last acccessed on 13th Oct. 2025

work page 2025
[5]

Barnes, and Imme Ebert-Uphoff

Antonios Mamalakis, Elizabeth A. Barnes, and Imme Ebert-Uphoff. Carefully choose the baseline: Lessons learned from applying XAI attribution methods for regression tasks in geoscience, August 2022. arXiv:2208.09473 [physics]

work page arXiv 2022
[6]

Explainable artiﬁcial intelligence for spectroscopy data: a review

Jhonatan Contreras and Thomas Bocklitz. Explainable artiﬁcial intelligence for spectroscopy data: a review. Pﬂügers Archiv - European Journal of Physiology , 477(4):603–615, April 2025

work page 2025
[7]

A Review on the Application of Machine Learning in Gamma Spectroscopy: Challenges and Opportunities

Mehrnaz Zehtabvar, Kazem Taghandiki, Nahid Madani, Dariush Sardari, and Bashir Bashiri. A Review on the Application of Machine Learning in Gamma Spectroscopy: Challenges and Opportunities. Spectroscopy Journal, 2(3):123–144, September 2024. Publisher: Multidisciplinary Digital Publishing Institute

work page 2024
[8]

Navigating shortcuts, spurious correlations, and confounders: From origins via detection to mitigation.arXiv preprint arXiv:2412.05152, 2024

David Steinmann, Felix Divo, Maurice Kraus, Antonia Wüst, Lukas Struppek, Felix Friedrich, and Kristian Ker- sting. Navigating Shortcuts, Spurious Correlations, and Confounders: From Origins via Detection to Mitigation, December 2024. arXiv:2412.05152 [cs]. 20 The Inﬁnite-Dimensional Nature of Spectroscopy and Why Models Succeed, Fail, and Mislead

work page arXiv 2024
[9]

The Clever Hans Phenomenon revisited

Laasya Samhita and Hans J Gross. The Clever Hans Phenomenon revisited. Communicative & Integrative Biology, 6(6):e27122, November 2013. Publisher: Taylor & Francis _eprint: https://doi.org/10.4161/cib.27122

work page doi:10.4161/cib.27122 2013
[10]

Trends in artiﬁcial intelligence, machine learning, and chemometrics applied to chemical data

Rola Houhou and Thomas Bocklitz. Trends in artiﬁcial intelligence, machine learning, and chemometrics applied to chemical data. Analytical Science Advances , 2(3-4):128–141, 2021. _eprint: https://chemistry- europe.onlinelibrary.wiley.com/doi/pdf/10.1002/ansa.202000162

work page doi:10.1002/ansa.202000162 2021
[11]

C. Th. J. Alkemade, W. Snelleman, G. D. Boutilier, B. D. Pollard, J. D. Winefordner, T. L. Chester, and N. Omenetto. A review and tutorial discussion of noise and signal-to-noise ratios in analytical spectrometryI. Fundamental principles of signal-to-noise ratios. Spectrochimica Acta Part B: Atomic Spectroscopy , 33(8):383– 399, January 1978

work page 1978
[12]

Equivalence and perpendicularity of gaussian processes

Jacob Feldman. Equivalence and perpendicularity of gaussian processes. Paciﬁc Journal of Mathematics , 8(4):699–708, 1958

work page 1958
[13]

On a property of normal distributions of any stochastic process

Jaroslav Hájek. On a property of normal distributions of any stochastic process. Czechoslovak Mathematical Journal, 8(4):610–618, 1958. Publisher: Institute of Mathematics, Academy of Sciences of the Czech Republic

work page 1958
[14]

Measure, Integration & Real Analysis , volume 282 of Graduate Texts in Mathematics

Sheldon Axler. Measure, Integration & Real Analysis , volume 282 of Graduate Texts in Mathematics. Springer International Publishing, Cham, 2020

work page 2020
[15]

The Feldman-Hájek dichotomy for countable Gaussian mixtures and their asymptotic separability in high dimensions

Umberto Michelucci. The Feldman-Hájek dichotomy for countable Gaussian mixtures and their asymptotic separability in high dimensions. https://arxiv.org/abs/2601.03911, January 2026. arXiv.2601.03911 [math.ST]

work page arXiv 2026
[16]

Microarrays and molecular research: noise discovery? The Lancet , 365(9458):454–455, 2005

John PA Ioannidis. Microarrays and molecular research: noise discovery? The Lancet , 365(9458):454–455, 2005

work page 2005
[17]

The properties of high-dimensional data spaces: implications for exploring gene and protein expression data

Robert Clarke, Habtom W Ressom, Antai Wang, Jianhua Xuan, Minetta C Liu, Edmund A Gehan, and Y ue Wang. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nature reviews cancer, 8(1):37–49, 2008

work page 2008
[18]

V oodoo correlations in social neuroscience

Edward Vul, Christine Harris, Piotr Winkielman, and Harold Pashler. V oodoo correlations in social neuroscience. Perspectives on psychological Science, 4(3):274–290, 2009

work page 2009
[19]

Azzalini and A

A. Azzalini and A. Dalla V alle. The multivariate skew-normal distribution. Biometrika, 83(4):715–726, Decem- ber 1996

work page 1996
[20]

Arellano-V alle, and Marc G

Sagnik Mondal, Reinaldo B. Arellano-V alle, and Marc G. Genton. A multivariate modiﬁed skew-normal distri- bution. Statistical Papers, 65(2):511–555, April 2024

work page 2024
[21]

Exploration of spanish olive oil quality with a miniaturized low-cost ﬂuorescence sensor and machine learning techniques

Francesca V enturini, Michela Sperti, Umberto Michelucci, Ivo Herzig, Michael Baumgartner, Josep Palau Ca- ballero, Arturo Jimenez, and Marco Agostino Deriu. Exploration of spanish olive oil quality with a miniaturized low-cost ﬂuorescence sensor and machine learning techniques. F oods, 10(5):1010, 2021

work page 2021
[22]

important

Scott M. Lundberg and Su-In Lee. A uniﬁed approach to interpreting model predictions. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 4765–4774. Curran Associates, Inc., 2017. 21 The Inﬁnite-Dimensional Nature of Spectroscopy and Why Model...

work page 2017

[1] [1]

Chemometrics in spectroscopy

Howard Mark and Jerry Workman Jr. Chemometrics in spectroscopy. Elsevier, 2010

work page 2010

[2] [2]

Camilo L. M. Morais, Kássio M. G. Lima, Maneesh Singh, and Francis L. Martin. Tutorial: multivariate clas- siﬁcation for vibrational spectroscopy in biological samples. Nature Protocols, 15(7):2143–2162, July 2020. Publisher: Nature Publishing Group

work page 2020

[3] [3]

Chemometric analysis in raman spectroscopy from experimen- tal design to machine learning–based modeling

Shuxia Guo, Jürgen Popp, and Thomas Bocklitz. Chemometric analysis in raman spectroscopy from experimen- tal design to machine learning–based modeling. Nature protocols, 16(12):5426–5459, 2021

work page 2021

[4] [4]

Demystifying the Black Box: Making Machine Learning Mod- els Explainable in Spectroscopy

Jerome Workman Jr. Demystifying the Black Box: Making Machine Learning Mod- els Explainable in Spectroscopy. https://www.spectroscopyonline.com/view/ demystifying-the-black-box-making-machine-learning-models-explainable-in-spectroscopy , September 2025. Last acccessed on 13th Oct. 2025

work page 2025

[5] [5]

Barnes, and Imme Ebert-Uphoff

Antonios Mamalakis, Elizabeth A. Barnes, and Imme Ebert-Uphoff. Carefully choose the baseline: Lessons learned from applying XAI attribution methods for regression tasks in geoscience, August 2022. arXiv:2208.09473 [physics]

work page arXiv 2022

[6] [6]

Explainable artiﬁcial intelligence for spectroscopy data: a review

Jhonatan Contreras and Thomas Bocklitz. Explainable artiﬁcial intelligence for spectroscopy data: a review. Pﬂügers Archiv - European Journal of Physiology , 477(4):603–615, April 2025

work page 2025

[7] [7]

A Review on the Application of Machine Learning in Gamma Spectroscopy: Challenges and Opportunities

Mehrnaz Zehtabvar, Kazem Taghandiki, Nahid Madani, Dariush Sardari, and Bashir Bashiri. A Review on the Application of Machine Learning in Gamma Spectroscopy: Challenges and Opportunities. Spectroscopy Journal, 2(3):123–144, September 2024. Publisher: Multidisciplinary Digital Publishing Institute

work page 2024

[8] [8]

Navigating shortcuts, spurious correlations, and confounders: From origins via detection to mitigation.arXiv preprint arXiv:2412.05152, 2024

David Steinmann, Felix Divo, Maurice Kraus, Antonia Wüst, Lukas Struppek, Felix Friedrich, and Kristian Ker- sting. Navigating Shortcuts, Spurious Correlations, and Confounders: From Origins via Detection to Mitigation, December 2024. arXiv:2412.05152 [cs]. 20 The Inﬁnite-Dimensional Nature of Spectroscopy and Why Models Succeed, Fail, and Mislead

work page arXiv 2024

[9] [9]

The Clever Hans Phenomenon revisited

Laasya Samhita and Hans J Gross. The Clever Hans Phenomenon revisited. Communicative & Integrative Biology, 6(6):e27122, November 2013. Publisher: Taylor & Francis _eprint: https://doi.org/10.4161/cib.27122

work page doi:10.4161/cib.27122 2013

[10] [10]

Trends in artiﬁcial intelligence, machine learning, and chemometrics applied to chemical data

Rola Houhou and Thomas Bocklitz. Trends in artiﬁcial intelligence, machine learning, and chemometrics applied to chemical data. Analytical Science Advances , 2(3-4):128–141, 2021. _eprint: https://chemistry- europe.onlinelibrary.wiley.com/doi/pdf/10.1002/ansa.202000162

work page doi:10.1002/ansa.202000162 2021

[11] [11]

C. Th. J. Alkemade, W. Snelleman, G. D. Boutilier, B. D. Pollard, J. D. Winefordner, T. L. Chester, and N. Omenetto. A review and tutorial discussion of noise and signal-to-noise ratios in analytical spectrometryI. Fundamental principles of signal-to-noise ratios. Spectrochimica Acta Part B: Atomic Spectroscopy , 33(8):383– 399, January 1978

work page 1978

[12] [12]

Equivalence and perpendicularity of gaussian processes

Jacob Feldman. Equivalence and perpendicularity of gaussian processes. Paciﬁc Journal of Mathematics , 8(4):699–708, 1958

work page 1958

[13] [13]

On a property of normal distributions of any stochastic process

Jaroslav Hájek. On a property of normal distributions of any stochastic process. Czechoslovak Mathematical Journal, 8(4):610–618, 1958. Publisher: Institute of Mathematics, Academy of Sciences of the Czech Republic

work page 1958

[14] [14]

Measure, Integration & Real Analysis , volume 282 of Graduate Texts in Mathematics

Sheldon Axler. Measure, Integration & Real Analysis , volume 282 of Graduate Texts in Mathematics. Springer International Publishing, Cham, 2020

work page 2020

[15] [15]

The Feldman-Hájek dichotomy for countable Gaussian mixtures and their asymptotic separability in high dimensions

Umberto Michelucci. The Feldman-Hájek dichotomy for countable Gaussian mixtures and their asymptotic separability in high dimensions. https://arxiv.org/abs/2601.03911, January 2026. arXiv.2601.03911 [math.ST]

work page arXiv 2026

[16] [16]

Microarrays and molecular research: noise discovery? The Lancet , 365(9458):454–455, 2005

John PA Ioannidis. Microarrays and molecular research: noise discovery? The Lancet , 365(9458):454–455, 2005

work page 2005

[17] [17]

The properties of high-dimensional data spaces: implications for exploring gene and protein expression data

Robert Clarke, Habtom W Ressom, Antai Wang, Jianhua Xuan, Minetta C Liu, Edmund A Gehan, and Y ue Wang. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nature reviews cancer, 8(1):37–49, 2008

work page 2008

[18] [18]

V oodoo correlations in social neuroscience

Edward Vul, Christine Harris, Piotr Winkielman, and Harold Pashler. V oodoo correlations in social neuroscience. Perspectives on psychological Science, 4(3):274–290, 2009

work page 2009

[19] [19]

Azzalini and A

A. Azzalini and A. Dalla V alle. The multivariate skew-normal distribution. Biometrika, 83(4):715–726, Decem- ber 1996

work page 1996

[20] [20]

Arellano-V alle, and Marc G

Sagnik Mondal, Reinaldo B. Arellano-V alle, and Marc G. Genton. A multivariate modiﬁed skew-normal distri- bution. Statistical Papers, 65(2):511–555, April 2024

work page 2024

[21] [21]

Exploration of spanish olive oil quality with a miniaturized low-cost ﬂuorescence sensor and machine learning techniques

Francesca V enturini, Michela Sperti, Umberto Michelucci, Ivo Herzig, Michael Baumgartner, Josep Palau Ca- ballero, Arturo Jimenez, and Marco Agostino Deriu. Exploration of spanish olive oil quality with a miniaturized low-cost ﬂuorescence sensor and machine learning techniques. F oods, 10(5):1010, 2021

work page 2021

[22] [22]

important

Scott M. Lundberg and Su-In Lee. A uniﬁed approach to interpreting model predictions. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 4765–4774. Curran Associates, Inc., 2017. 21 The Inﬁnite-Dimensional Nature of Spectroscopy and Why Model...

work page 2017