pith. sign in

arxiv: 2604.04717 · v1 · submitted 2026-04-06 · 💻 cs.LG · cond-mat.mtrl-sci· cs.AI· stat.ML

The Infinite-Dimensional Nature of Spectroscopy and Why Models Succeed, Fail, and Mislead

Pith reviewed 2026-05-10 18:53 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.mtrl-scics.AIstat.ML
keywords spectroscopymachine learninghigh dimensionalityconcentration of measurefluorescence spectramodel interpretabilitydistributional differencesFeldman-Hajek theorem
0
0 comments X

The pith

Spectroscopic data's infinite dimensionality lets ML models separate even infinitesimal noise differences with perfect accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that spectroscopic datasets behave as if they live in infinite-dimensional spaces, where tiny distributional shifts from noise, normalization, or instruments become linearly separable. This explains why models reach near-perfect classification accuracy on fluorescence spectra even when no chemical differences exist between classes. It also accounts for why feature-importance maps often flag spectrally irrelevant regions. The authors ground the argument in the Feldman-Hajek theorem and concentration of measure, then confirm the effect with controlled synthetic and real experiments. A sympathetic reader cares because this mechanism can produce apparently successful models that fail to generalize or reveal actual chemistry.

Core claim

Even infinitesimal distributional differences in spectral data, whether from noise, preprocessing, or artefacts, become perfectly separable in high-dimensional spaces according to the Feldman-Hajek theorem and concentration of measure; experiments on synthetic and real fluorescence spectra demonstrate that models achieve near-perfect accuracy without any underlying chemical distinctions, while feature-importance maps highlight regions unrelated to the spectra's chemical content.

What carries the argument

The effective infinite-dimensional character of spectral data, which amplifies minuscule distributional shifts into perfect separability via the Feldman-Hajek theorem and concentration of measure.

If this is right

  • ML classifiers can reach high reported accuracy on spectroscopic tasks without learning chemically meaningful features.
  • Feature-importance methods will frequently highlight regions that carry no chemical information.
  • Standard cross-validation may not detect when separation relies on artefacts rather than chemistry.
  • Preprocessing steps such as normalization can create or amplify spurious separability.
  • Model reliability in spectroscopy requires explicit checks for whether separation persists under controlled chemical conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners may need dimensionality-reduction or representation-learning steps that explicitly preserve chemical invariants before applying classifiers.
  • Similar high-dimensional separation effects could appear in other scientific domains that produce dense spectral or waveform data.
  • Interpretability techniques beyond feature importance, such as counterfactual generation under chemical constraints, become necessary to distinguish artefact-driven from chemistry-driven decisions.

Load-bearing premise

Real spectroscopic measurements behave enough like infinite-dimensional random variables whose differences are purely distributional rather than tied to underlying chemical structure.

What would settle it

A controlled experiment on fluorescence spectra where class labels are assigned randomly or by non-chemical criteria yet models trained after standard preprocessing still fail to reach near-perfect accuracy.

Figures

Figures reproduced from arXiv: 2604.04717 by Francesca Venturini, Umberto Michelucci.

Figure 1
Figure 1. Figure 1: Illustration of the concentration of measure for multivariate Gaussian distributions. Shown are the empirical distributions of ∥x∥2 for samples drawn from N (0, 1.0 2 In) (light blue) and N (0, 1.1 2 In) (yellow), for increasing dimensionalities n = 2, 50, 500, and 5000 (panels AD). In low dimensions the two distributions overlap substantially, but as n increases the probability mass concentrates sharply a… view at source ↗
Figure 2
Figure 2. Figure 2: Ten representative synthetic one-peak spectra per class used in the synthetic-spectra experiments. Each curve is a Lorentzian profile sampled on an n = 100-point axis, with peak centre jittered as c ∼ N (50, 102 ). Class 0 (blue) and Class 1 (orange) differ only through the FWHM ξ1 = 7 vs. ξ2 = 9, illustrating that the two classes are visually difficult to distinguish despite being statistically separable … view at source ↗
Figure 3
Figure 3. Figure 3: Fluorescence spectra of Spanish olive oil samples classified as Extra Virgin (EVOO), Virgin (VOO), and Lampante (LOO). The region 380–420 nm indicates the Rayleigh scattering peak from the excitation LED. The black line indicates the average spectrum for each class. evaluated the LDA decision boundary T for this setting T = n log ( σ 2 2 σ 2 1 ) 1 1 σ 2 2 − 1 σ 2 1 (5) which classifies a sample by threshol… view at source ↗
Figure 4
Figure 4. Figure 4: Results from experiment N1. Classification accuracy of QDA (regularisation parameter equal to 0.4) as a function of the standard-deviation gap ∆σ between two white-noise classes with equal mean µ = 1 and baseline σ1 = 1. Each curve corresponds to a different number of points per array (n ∈ [5, 10, 50, 500]); at each ∆σ, N arrays per class are generated and split 80/20 into train/test. The dashed line at 1.… view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Results from experiment N3. Test accuracy of QDA (reg_param = 0.4) versus the number of points per array n for two Gaussian white-noise classes with common mean µ and variances σ 2 1 In vs. σ 2 2 In. Each curve corresponds to a different standard-deviation gap ∆σ = σ2 − σ1. Datasets contain N arrays per class and are split 80/20 into train/test. The dashed horizontal line at 1.0 indicates perfect accuracy.… view at source ↗
Figure 7
Figure 7. Figure 7: Results from experiment N4. Crossvalidated classification accuracy for four models (from top to bottom: Random Forest, kNN, Decision Tree, Logistic Regression) on synthetic data drawn from a skewnormal distribution in dimension n = 50. Each column sweeps two parameters while holding the third at its baseline: (A) ∆µ/µ1 vs. ∆σ/σ1, (B) ∆µ/µ1 vs. ∆γ/γ1, (C) ∆σ/σ1 vs. ∆γ/γ1, where γ is the skew (shape) paramet… view at source ↗
Figure 8
Figure 8. Figure 8: Results from experiment S2. Model validation accuracy (mean ±1 SD over 5-fold CV) versus spectrum length n (log scale) for four classifiers: logistic regression, k-NN, decision tree (max depth 5), and random forest (100 trees). Data consist of synthetic onepeak Lorentzian spectra: the two classes differ only in width (ξ1 = 7 vs. ξ2 = 9) while peak centres are jittered c ∼ N (50, 102 ); 500 spectra per clas… view at source ↗
Figure 9
Figure 9. Figure 9: Results from experiment S3. Validation accuracy (mean ±1 SD over 5-fold CV) versus spectrum length n (log scale) for four classifiers on synthetic onepeak spectra with identical signal distributions but class-specific additive noise. Each class contains 500 spectra with Lorentzian FWHM ξ = 7 and centres jittered c ∼ N (50, 102 ); noise is i.i.d. Gaussian with mean 0 (class 0) or 0.01 (class 1) and SD 0.01.… view at source ↗
Figure 10
Figure 10. Figure 10: Results for experiment Ra3/Rb3. LOO-CV classification accuracy as a function of the number of randomly selected pixels (k) from the spectral noise region (pixels 0–50). Panel (A) shows the results for EVOO vs. LOO, and Panel (B) for EVOO vs. VOO. Each data point represents the mean accuracy over 20 independent random subsets, with error bars indicating the standard deviation. The rapid climb to accuracies… view at source ↗
Figure 11
Figure 11. Figure 11: Empirical covariance matrices of the fluorescence spectra for the two olive oil classes (EXTRA and LAM￾PANTE). Bright red areas correspond to regions of strong inter-wavelength covariances, notably around the main fluorescence peak and stray-light regions. Such covariance mismatches are sufficient, in high-dimensional space, to enable nearly perfect classification even when chemically meaningful informati… view at source ↗
Figure 12
Figure 12. Figure 12: Results for experiment Ra4/Rb4. LOO-CV classification accuracy is mapped across the fluorescence spec￾trum (gray line) using non-overlapping windows of increasing size W. Left Column (Panels A, C, E, G): Experiment EVOO vs. LOO. Right Column (Panels B, D, F, H): Experiment EVOO vs. VOO. The spectral region between 380– 420 nm was explicitly removed to eliminate the Rayleigh scattering peak as a trivial di… view at source ↗
Figure 13
Figure 13. Figure 13: Results for experiment Ra5/Rb5. Regional Feature Attribution Map. Mean absolute SHAP values are presented in arbitrary units (a.u.) to facilitate the comparison of relative feature importance across varying window sizes (W). Left Column (Panels A, C, E, G): Experiment EVOO vs. LOO. Right Column (Panels B, D, F, H): Experiment EVOO vs. VOO. The spectral region between 380–420 nm was explicitly removed to e… view at source ↗
read the original abstract

Machine learning (ML) models have achieved strikingly high accuracies in spectroscopic classification tasks, often without a clear proof that those models used chemically meaningful features. Existing studies have linked these results to data preprocessing choices, noise sensitivity, and model complexity, but no unifying explanation is available so far. In this work, we show that these phenomena arise naturally from the intrinsic high dimensionality of spectral data. Using a theoretical analysis grounded in the Feldman-Hajek theorem and the concentration of measure, we show that even infinitesimal distributional differences, caused by noise, normalisation, or instrumental artefacts, may become perfectly separable in high-dimensional spaces. Through a series of specific experiments on synthetic and real fluorescence spectra, we illustrate how models can achieve near-perfect accuracy even when chemical distinctions are absent, and why feature-importance maps may highlight spectrally irrelevant regions. We provide a rigorous theoretical framework, confirm the effect experimentally, and conclude with practical recommendations for building and interpreting ML models in spectroscopy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that the strikingly high accuracies of ML models on spectroscopic classification arise from the intrinsic high (effectively infinite) dimensionality of spectral data. Grounded in the Feldman-Hajek theorem and concentration-of-measure phenomena, it argues that even infinitesimal distributional shifts induced by noise, normalization, or instrumental artifacts become perfectly separable, allowing models to succeed without using chemically meaningful features; this is illustrated via experiments on synthetic and real fluorescence spectra, with accompanying practical recommendations for model building and interpretation.

Significance. If the central theoretical link holds, the work supplies a unifying explanation for otherwise puzzling ML behaviors in spectroscopy, including sensitivity to preprocessing and misleading feature attributions. It applies standard infinite-dimensional probability results to a concrete domain and backs the argument with targeted experiments, which could usefully inform best practices and caution against over-interpreting black-box models on spectral data.

major comments (2)
  1. [Theoretical analysis] Theoretical analysis section: the claim that infinitesimal distributional differences become perfectly separable rests on the Feldman-Hajek theorem for mutually singular Gaussian measures on Hilbert space. Real fluorescence spectra, however, are finite-dimensional vectors in R^D (D typically 500–2000 after discretization). The manuscript must explicitly verify that the observed perturbations (noise, normalization) place the mean shift outside the Cameron-Martin space or alter the covariance operator in the manner required for singularity; otherwise the “perfect separability” conclusion does not follow from the cited theorem.
  2. [Experiments] Experimental section on synthetic and real spectra: the reported near-perfect accuracies are consistent with concentration-of-measure effects in large but finite D, yet the experiments do not include a controlled comparison that isolates the infinite-dimensional singularity mechanism (e.g., by varying D while holding chemical content fixed, or by checking equivalence vs. singularity of the induced measures). Without such controls, it remains possible that the observed separability arises from ordinary finite-dimensional geometry rather than the Feldman-Hajek phenomenon.
minor comments (2)
  1. [Abstract and introduction] The abstract states that the work provides “a rigorous theoretical framework,” but the main text should include the precise statement of the Feldman-Hajek conditions that are being invoked and the mapping from spectral preprocessing steps to those conditions.
  2. [Notation and definitions] Notation for the spectral measures and the covariance operators should be introduced once and used consistently; several passages refer to “distributional differences” without specifying whether these are in total variation, Hellinger, or another metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and limitations of our theoretical framework. We respond to each major comment below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Theoretical analysis] Theoretical analysis section: the claim that infinitesimal distributional differences become perfectly separable rests on the Feldman-Hajek theorem for mutually singular Gaussian measures on Hilbert space. Real fluorescence spectra, however, are finite-dimensional vectors in R^D (D typically 500–2000 after discretization). The manuscript must explicitly verify that the observed perturbations (noise, normalization) place the mean shift outside the Cameron-Martin space or alter the covariance operator in the manner required for singularity; otherwise the “perfect separability” conclusion does not follow from the cited theorem.

    Authors: We agree that the Feldman-Hajek theorem applies strictly in infinite-dimensional Hilbert spaces and that discretized spectra live in finite-dimensional R^D. Our argument treats the high-D regime (D ≫ 1) as an effective approximation to the infinite-dimensional case, where concentration-of-measure effects make even small shifts in mean or covariance produce near-singular measures. We will revise the theoretical section to (i) state the finite-D limitation explicitly, (ii) recall the precise Cameron-Martin condition for Gaussian singularity, and (iii) provide a brief calculation showing that typical normalization and additive-noise perturbations in fluorescence spectra satisfy the required shift outside the Cameron-Martin space for the covariance operators we consider. This will make the link between the theorem and the observed separability rigorous within the finite-D setting. revision: yes

  2. Referee: [Experiments] Experimental section on synthetic and real spectra: the reported near-perfect accuracies are consistent with concentration-of-measure effects in large but finite D, yet the experiments do not include a controlled comparison that isolates the infinite-dimensional singularity mechanism (e.g., by varying D while holding chemical content fixed, or by checking equivalence vs. singularity of the induced measures). Without such controls, it remains possible that the observed separability arises from ordinary finite-dimensional geometry rather than the Feldman-Hajek phenomenon.

    Authors: The synthetic-data experiments already allow D to be varied while keeping the underlying chemical signal fixed; we will add an explicit panel (or supplementary figure) that plots classification accuracy against increasing D for fixed noise and normalization levels. This will demonstrate the transition toward perfect separability as D grows, consistent with the concentration-of-measure and Feldman-Hajek limits. For the real spectra we will include a short discussion noting that the observed D (≈ 1000) already places the data in the regime where the finite-dimensional geometry approximates the infinite-dimensional singularity. These additions directly address the request for a controlled isolation of the mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims rest on external theorems

full rationale

The paper derives its main result from the Feldman-Hajek theorem on equivalence/singularity of Gaussian measures on Hilbert space and the concentration-of-measure phenomenon, both standard external mathematical results independent of the present work. The abstract and description explicitly ground the separability claim in these theorems rather than in any self-defined quantity, fitted parameter, or prior self-citation. Experiments on synthetic and real spectra serve only as illustration, not as definitional inputs that are then re-predicted. No load-bearing step reduces by construction to the paper's own outputs or to a self-citation chain; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two standard mathematical results with no free parameters or new entities introduced.

axioms (2)
  • standard math Feldman-Hajek theorem
    Invoked to establish when two measures on infinite-dimensional spaces are mutually singular and thus perfectly separable.
  • standard math Concentration of measure phenomenon
    Used to argue that small distributional perturbations become large separations in high dimensions.

pith-pipeline@v0.9.0 · 5473 in / 1329 out tokens · 36468 ms · 2026-05-10T18:53:47.675354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    Chemometrics in spectroscopy

    Howard Mark and Jerry Workman Jr. Chemometrics in spectroscopy. Elsevier, 2010

  2. [2]

    Camilo L. M. Morais, Kássio M. G. Lima, Maneesh Singh, and Francis L. Martin. Tutorial: multivariate clas- sification for vibrational spectroscopy in biological samples. Nature Protocols, 15(7):2143–2162, July 2020. Publisher: Nature Publishing Group

  3. [3]

    Chemometric analysis in raman spectroscopy from experimen- tal design to machine learning–based modeling

    Shuxia Guo, Jürgen Popp, and Thomas Bocklitz. Chemometric analysis in raman spectroscopy from experimen- tal design to machine learning–based modeling. Nature protocols, 16(12):5426–5459, 2021

  4. [4]

    Demystifying the Black Box: Making Machine Learning Mod- els Explainable in Spectroscopy

    Jerome Workman Jr. Demystifying the Black Box: Making Machine Learning Mod- els Explainable in Spectroscopy. https://www.spectroscopyonline.com/view/ demystifying-the-black-box-making-machine-learning-models-explainable-in-spectroscopy , September 2025. Last acccessed on 13th Oct. 2025

  5. [5]

    Barnes, and Imme Ebert-Uphoff

    Antonios Mamalakis, Elizabeth A. Barnes, and Imme Ebert-Uphoff. Carefully choose the baseline: Lessons learned from applying XAI attribution methods for regression tasks in geoscience, August 2022. arXiv:2208.09473 [physics]

  6. [6]

    Explainable artificial intelligence for spectroscopy data: a review

    Jhonatan Contreras and Thomas Bocklitz. Explainable artificial intelligence for spectroscopy data: a review. Pflügers Archiv - European Journal of Physiology , 477(4):603–615, April 2025

  7. [7]

    A Review on the Application of Machine Learning in Gamma Spectroscopy: Challenges and Opportunities

    Mehrnaz Zehtabvar, Kazem Taghandiki, Nahid Madani, Dariush Sardari, and Bashir Bashiri. A Review on the Application of Machine Learning in Gamma Spectroscopy: Challenges and Opportunities. Spectroscopy Journal, 2(3):123–144, September 2024. Publisher: Multidisciplinary Digital Publishing Institute

  8. [8]

    Navigating shortcuts, spurious correlations, and confounders: From origins via detection to mitigation.arXiv preprint arXiv:2412.05152, 2024

    David Steinmann, Felix Divo, Maurice Kraus, Antonia Wüst, Lukas Struppek, Felix Friedrich, and Kristian Ker- sting. Navigating Shortcuts, Spurious Correlations, and Confounders: From Origins via Detection to Mitigation, December 2024. arXiv:2412.05152 [cs]. 20 The Infinite-Dimensional Nature of Spectroscopy and Why Models Succeed, Fail, and Mislead

  9. [9]

    The Clever Hans Phenomenon revisited

    Laasya Samhita and Hans J Gross. The Clever Hans Phenomenon revisited. Communicative & Integrative Biology, 6(6):e27122, November 2013. Publisher: Taylor & Francis _eprint: https://doi.org/10.4161/cib.27122

  10. [10]

    Trends in artificial intelligence, machine learning, and chemometrics applied to chemical data

    Rola Houhou and Thomas Bocklitz. Trends in artificial intelligence, machine learning, and chemometrics applied to chemical data. Analytical Science Advances , 2(3-4):128–141, 2021. _eprint: https://chemistry- europe.onlinelibrary.wiley.com/doi/pdf/10.1002/ansa.202000162

  11. [11]

    C. Th. J. Alkemade, W. Snelleman, G. D. Boutilier, B. D. Pollard, J. D. Winefordner, T. L. Chester, and N. Omenetto. A review and tutorial discussion of noise and signal-to-noise ratios in analytical spectrometryI. Fundamental principles of signal-to-noise ratios. Spectrochimica Acta Part B: Atomic Spectroscopy , 33(8):383– 399, January 1978

  12. [12]

    Equivalence and perpendicularity of gaussian processes

    Jacob Feldman. Equivalence and perpendicularity of gaussian processes. Pacific Journal of Mathematics , 8(4):699–708, 1958

  13. [13]

    On a property of normal distributions of any stochastic process

    Jaroslav Hájek. On a property of normal distributions of any stochastic process. Czechoslovak Mathematical Journal, 8(4):610–618, 1958. Publisher: Institute of Mathematics, Academy of Sciences of the Czech Republic

  14. [14]

    Measure, Integration & Real Analysis , volume 282 of Graduate Texts in Mathematics

    Sheldon Axler. Measure, Integration & Real Analysis , volume 282 of Graduate Texts in Mathematics. Springer International Publishing, Cham, 2020

  15. [15]

    The Feldman-Hájek dichotomy for countable Gaussian mixtures and their asymptotic separability in high dimensions

    Umberto Michelucci. The Feldman-Hájek dichotomy for countable Gaussian mixtures and their asymptotic separability in high dimensions. https://arxiv.org/abs/2601.03911, January 2026. arXiv.2601.03911 [math.ST]

  16. [16]

    Microarrays and molecular research: noise discovery? The Lancet , 365(9458):454–455, 2005

    John PA Ioannidis. Microarrays and molecular research: noise discovery? The Lancet , 365(9458):454–455, 2005

  17. [17]

    The properties of high-dimensional data spaces: implications for exploring gene and protein expression data

    Robert Clarke, Habtom W Ressom, Antai Wang, Jianhua Xuan, Minetta C Liu, Edmund A Gehan, and Y ue Wang. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nature reviews cancer, 8(1):37–49, 2008

  18. [18]

    V oodoo correlations in social neuroscience

    Edward Vul, Christine Harris, Piotr Winkielman, and Harold Pashler. V oodoo correlations in social neuroscience. Perspectives on psychological Science, 4(3):274–290, 2009

  19. [19]

    Azzalini and A

    A. Azzalini and A. Dalla V alle. The multivariate skew-normal distribution. Biometrika, 83(4):715–726, Decem- ber 1996

  20. [20]

    Arellano-V alle, and Marc G

    Sagnik Mondal, Reinaldo B. Arellano-V alle, and Marc G. Genton. A multivariate modified skew-normal distri- bution. Statistical Papers, 65(2):511–555, April 2024

  21. [21]

    Exploration of spanish olive oil quality with a miniaturized low-cost fluorescence sensor and machine learning techniques

    Francesca V enturini, Michela Sperti, Umberto Michelucci, Ivo Herzig, Michael Baumgartner, Josep Palau Ca- ballero, Arturo Jimenez, and Marco Agostino Deriu. Exploration of spanish olive oil quality with a miniaturized low-cost fluorescence sensor and machine learning techniques. F oods, 10(5):1010, 2021

  22. [22]

    important

    Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 4765–4774. Curran Associates, Inc., 2017. 21 The Infinite-Dimensional Nature of Spectroscopy and Why Model...