FlexMS is a flexible framework for benchmarking deep learning-based mass spectrum prediction tools in metabolomics
Pith reviewed 2026-05-15 19:18 UTC · model grok-4.3
The pith
FlexMS creates a benchmark framework for constructing and evaluating diverse deep learning architectures to predict mass spectra.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlexMS is a benchmark framework for mass spectrum prediction that supports the dynamic construction of numerous distinct combinations of model architectures while assessing their performance on preprocessed public datasets using different metrics, providing insights into influencing factors like structural diversity, hyperparameters, pretraining effects, metadata ablation settings, and cross-domain transfer learning analysis.
What carries the argument
FlexMS framework supporting dynamic construction of model architecture combinations and performance evaluation on datasets via multiple metrics.
If this is right
- Insights emerge on how dataset structural diversity, learning rate, data sparsity, pretraining, metadata ablation, and transfer learning affect model performance.
- The framework supplies practical guidance for selecting suitable models for mass spectrum tasks.
- Retrieval benchmarks simulate real identification scenarios by scoring matches against predicted spectra.
Where Pith is reading between the lines
- The modular design could support rapid testing of hybrid architectures that combine elements from multiple existing approaches.
- Standardized benchmarks like this might reduce redundant experiments when new prediction models appear in the literature.
- Extending the framework to incorporate private or proprietary spectra collections could reveal whether public-data insights generalize to laboratory settings.
Load-bearing premise
The preprocessed public datasets and chosen evaluation metrics sufficiently represent real-world metabolomics identification challenges and that dynamic model combinations will yield practically useful performance insights.
What would settle it
Applying FlexMS to a fresh collection of experimental spectra from an unseen metabolomics source and observing no statistically significant performance gaps across tested model combinations would undermine the claimed value of the flexible benchmarking approach.
read the original abstract
The identification and property prediction of chemical molecules is of central importance in the advancement of drug discovery and material science, where the tandem mass spectrometry technology gives valuable fragmentation cues in the form of mass-to-charge ratio peaks. However, the lack of experimental spectra hinders the attachment of each molecular identification, and thus urges the establishment of prediction approaches for computational models. Deep learning models appear promising for predicting molecular structure spectra, but overall assessment remains challenging as a result of the heterogeneity in methods and the lack of well-defined benchmarks. To address this, our contribution is the creation of benchmark framework FlexMS for constructing and evaluating diverse model architectures in mass spectrum prediction. With its easy-to-use flexibility, FlexMS supports the dynamic construction of numerous distinct combinations of model architectures, while assessing their performance on preprocessed public datasets using different metrics. In this paper, we provide insights into factors influencing performance, including the structural diversity of datasets, hyperparameters like learning rate and data sparsity, pretraining effects, metadata ablation settings and cross-domain transfer learning analysis. This provides practical guidance in choosing suitable models. Moreover, retrieval benchmarks simulate practical identification scenarios and score potential matches based on predicted spectra.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FlexMS, a flexible benchmark framework for constructing and evaluating diverse deep learning model architectures for mass spectrum prediction in metabolomics. It supports dynamic combinations of architectures, evaluates them on preprocessed public datasets using multiple metrics, and reports insights on factors including learning rate, data sparsity, pretraining, metadata ablation, cross-domain transfer learning, and retrieval benchmarks for identification scenarios.
Significance. If the framework is robustly implemented with reproducible code and the reported insights are backed by concrete, validated results, FlexMS could help standardize benchmarking in a heterogeneous field and provide practical guidance for model selection. The emphasis on flexibility in architecture combinations and retrieval benchmarks is a positive contribution toward addressing the lack of well-defined evaluation standards.
major comments (2)
- [Abstract] Abstract: the claim that the framework 'provides insights into factors influencing performance' and 'practical guidance in choosing suitable models' cannot be assessed because the abstract (and available text) supplies no concrete quantitative results, error bars, specific findings, or validation details on any of the listed factors.
- [Framework evaluation sections] Framework evaluation sections: the practical guidance on hyperparameters, pretraining, and transfer learning is load-bearing only if the preprocessed public datasets preserve key real-world difficulties (instrument-specific noise, adduct distributions, lab-varying sparsity). If preprocessing steps systematically reduce these challenges, performance differences across model combinations may be benchmark artifacts rather than generalizable signals.
minor comments (2)
- [Abstract] Abstract: the title and opening sentence could more clearly distinguish the contribution as a benchmarking framework rather than a new prediction method.
- [Throughout] Notation and terminology: ensure consistent use of terms such as 'metadata ablation' and 'cross-domain transfer' with explicit definitions or references on first use.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript introducing FlexMS. We address each major comment point by point below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the framework 'provides insights into factors influencing performance' and 'practical guidance in choosing suitable models' cannot be assessed because the abstract (and available text) supplies no concrete quantitative results, error bars, specific findings, or validation details on any of the listed factors.
Authors: We agree that the abstract should include concrete quantitative results to make the claims directly assessable. Although the full manuscript reports specific findings on factors such as learning rate effects, data sparsity, pretraining, metadata ablation, and retrieval accuracies in the evaluation sections, the abstract is currently high-level. In the revised manuscript we will update the abstract to summarize key quantitative results, including performance metrics with error bars from multiple runs and specific insights on the listed factors. revision: yes
-
Referee: [Framework evaluation sections] Framework evaluation sections: the practical guidance on hyperparameters, pretraining, and transfer learning is load-bearing only if the preprocessed public datasets preserve key real-world difficulties (instrument-specific noise, adduct distributions, lab-varying sparsity). If preprocessing steps systematically reduce these challenges, performance differences across model combinations may be benchmark artifacts rather than generalizable signals.
Authors: We acknowledge the importance of ensuring the preprocessed datasets retain real-world characteristics. The preprocessing follows standard metabolomics practices on public datasets (e.g., GNPS) and is intended to retain instrument noise patterns, adduct distributions, and sparsity variations. In the revision we will expand the methods and evaluation sections with explicit statistics comparing pre- and post-preprocessing distributions for adducts and sparsity, plus a brief discussion of limitations. We will not add entirely new raw-data experiments in this revision as they fall outside the current scope, but the added details will clarify the benchmark's fidelity. revision: partial
Circularity Check
No circularity: benchmark framework paper introduces tool without self-referential derivations
full rationale
The manuscript describes the creation of the FlexMS software framework for constructing and evaluating mass spectrum prediction models on preprocessed public datasets. Its central contribution is the tool's flexibility in supporting dynamic model combinations and reporting observational insights on hyperparameters, pretraining, and transfer learning. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. The work is self-contained as an engineering contribution; performance claims rest on external public datasets and standard metrics rather than reducing to the framework's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard supervised deep learning training procedures and evaluation metrics apply to mass spectrum prediction
Reference graph
Works this paper leans on
-
[1]
Nature 537(7620), 347–355 (2016)
Aebersold, R., Mann, M.: Mass-spectrometric exploration of proteome structure and function. Nature 537(7620), 347–355 (2016)
work page 2016
-
[2]
John Wiley & Sons, Chichester, England (2007)
De Hoffmann, E., Stroobant, V.: Mass Spectrometry: Principles and Applications. John Wiley & Sons, Chichester, England (2007)
work page 2007
-
[3]
Mass spectrometry reviews37(4), 513–532 (2018)
Kind, T., Tsugawa, H., Cajka, T., Ma, Y., Lai, Z., Mehta, S.S., Wohlgemuth, G., Barupal, D.K., Showalter, M.R., Arita, M.,et al.: Identification of small molecules using accurate mass ms/ms search. Mass spectrometry reviews37(4), 513–532 (2018)
work page 2018
-
[4]
Nature methods16(4), 299–302 (2019)
D¨ uhrkop, K., Fleischauer, M., Ludwig, M., Aksenov, A.A., Melnik, A.V., Meusel, M., Dorrestein, P.C., Rousu, J., B¨ ocker, S.: Sirius 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nature methods16(4), 299–302 (2019)
work page 2019
-
[5]
Nucleic acids research46(D1), 608–617 (2018)
Wishart, D.S., Feunang, Y.D., Marcu, A., Guo, A.C., Liang, K., V´ azquez-Fresno, R., Sajed, T., John- son, D., Li, C., Karu, N.,et al.: Hmdb 4.0: the human metabolome database for 2018. Nucleic acids research46(D1), 608–617 (2018)
work page 2018
-
[6]
Journal of mass spectrometry45(7), 703–714 (2010)
Horai, H., Arita, M., Kanaya, S., Nihei, Y., Ikeda, T., Suwa, K., Ojima, Y., Tanaka, K., Tanaka, S., Aoshima, K.,et al.: Massbank: a public repository for sharing mass spectral data for life sciences. Journal of mass spectrometry45(7), 703–714 (2010)
work page 2010
-
[7]
Journal of chemical information and modeling59(8), 3370–3388 (2019)
Yang, K., Swanson, K., Jin, W., Coley, C., Eiden, P., Gao, H., Guzman-Perez, A., Hopper, T., Kelley, B., Mathea, M.,et al.: Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling59(8), 3370–3388 (2019)
work page 2019
-
[8]
BMC bioinformatics19(Suppl 19), 526 (2018)
Hirohara, M., Saito, Y., Koda, Y., Sato, K., Sakakibara, Y.: Convolutional neural network based on smiles representation of compounds for detecting chemical motif. BMC bioinformatics19(Suppl 19), 526 (2018)
work page 2018
-
[9]
ChemBERTa: large -scale self -supervised pretraining fo r molecular property prediction
Chithrananda, S., Grand, G., Ramsundar, B.: Chemberta: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885 (2020)
- [10]
-
[11]
Analytical Chemistry96(8), 3419–3428 (2024)
Goldman, S., Li, J., Coley, C.W.: Generating molecular fragmentation graphs with autoregressive neural networks. Analytical Chemistry96(8), 3419–3428 (2024)
work page 2024
-
[12]
Advances in Neural Information Processing Systems36, 48548–48572 (2023)
Goldman, S., Bradshaw, J., Xin, J., Coley, C.: Prefix-tree decoding for predicting mass spectra from molecules. Advances in Neural Information Processing Systems36, 48548–48572 (2023)
work page 2023
-
[13]
NIST/EPA/NIH Mass Spectral Library and NIST Tandem Mass Spectral Library (2020)
Standards, N.I., Technology: NIST 20 Mass Spectral Library (NIST/EPA/NIH). NIST/EPA/NIH Mass Spectral Library and NIST Tandem Mass Spectral Library (2020)
work page 2020
-
[14]
Mass Spectrometry3(Special Issue 2), 0033–0033 (2014)
Ridder, L., Hooft, J.J.J., Verhoeven, S.: Automatic compound annotation from mass spectrometry data using magma. Mass Spectrometry3(Special Issue 2), 0033–0033 (2014)
work page 2014
-
[15]
Nature Machine Intelligence6(4), 404–416 (2024)
Young, A., R¨ ost, H., Wang, B.: Tandem mass spectrum prediction for small molecules using graph transformers. Nature Machine Intelligence6(4), 404–416 (2024)
work page 2024
-
[16]
Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., Liu, T.-Y.: Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems34, 28877–28888 (2021)
work page 2021
-
[17]
Metabolomics11(1), 98–110 (2015)
Allen, F., Greiner, R., Wishart, D.: Competitive fragmentation modeling of esi-ms/ms spectra for putative metabolite identification. Metabolomics11(1), 98–110 (2015)
work page 2015
-
[18]
Journal of cheminformatics8(1), 3 (2016) 22
Ruttkies, C., Schymanski, E.L., Wolf, S., Hollender, J., Neumann, S.: Metfrag relaunched: incorporating strategies beyond in silico fragmentation. Journal of cheminformatics8(1), 3 (2016) 22
work page 2016
-
[19]
ACS central science5(4), 700–708 (2019)
Wei, J.N., Belanger, D., Adams, R.P., Sculley, D.: Rapid prediction of electron–ionization mass spectrometry using neural networks. ACS central science5(4), 700–708 (2019)
work page 2019
-
[20]
Nature biotechnology34(8), 828–837 (2016)
Wang, M., Carver, J.J., Phelan, V.V., Sanchez, L.M., Garg, N., Peng, Y., Nguyen, D.D., Watrous, J., Kapono, C.A., Luzzatto-Knaan, T.,et al.: Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nature biotechnology34(8), 828–837 (2016)
work page 2016
-
[21]
Nature biotechnology39(4), 462–471 (2021)
D¨ uhrkop, K., Nothias, L.-F., Fleischauer, M., Reher, R., Ludwig, M., Hoffmann, M.A., Petras, D., Gerwick, W.H., Rousu, J., Dorrestein, P.C.,et al.: Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nature biotechnology39(4), 462–471 (2021)
work page 2021
-
[22]
Advances in Neural Information Processing Systems37, 110010–110027 (2024)
Bushuiev, R., Bushuiev, A., Jonge, N., Young, A., Kretschmer, F., Samusevich, R., Heirman, J., Wang, F., Zhang, L., D¨ uhrkop, K.,et al.: Massspecgym: A benchmark for the discovery and identification of molecules. Advances in Neural Information Processing Systems37, 110010–110027 (2024)
work page 2024
-
[23]
Journal of cheminformatics9(1), 22 (2017)
Schymanski, E.L., Ruttkies, C., Krauss, M., Brouard, C., Kind, T., D¨ uhrkop, K., Allen, F., Vaniya, A., Verdegem, D., B¨ ocker, S.,et al.: Critical assessment of small molecule identification 2016: automated methods. Journal of cheminformatics9(1), 22 (2017)
work page 2016
-
[24]
Current opinion in chemical biology3(3), 342– 349 (1999)
Mason, J.S., Hermsmeier, M.A.: Diversity assessment. Current opinion in chemical biology3(3), 342– 349 (1999)
work page 1999
-
[25]
Journal of chemical information and modeling50(5), 742–754 (2010)
Rogers, D., Hahn, M.: Extended-connectivity fingerprints. Journal of chemical information and modeling50(5), 742–754 (2010)
work page 2010
-
[26]
Bajusz, D., R´ acz, A., H´ eberger, K.: Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations? Journal of cheminformatics7(1), 20 (2015)
work page 2015
-
[27]
Chemical science9(2), 513–530 (2018)
Wu, Z., Ramsundar, B., Feinberg, E.N., Gomes, J., Geniesse, C., Pappu, A.S., Leswing, K., Pande, V.: Moleculenet: a benchmark for molecular machine learning. Chemical science9(2), 513–530 (2018)
work page 2018
-
[28]
Wiley statsref: Statistics reference online (2014)
Berger, V.W., Zhou, Y.: Kolmogorov–smirnov test: Overview. Wiley statsref: Statistics reference online (2014)
work page 2014
-
[29]
Journal of chemical information and modeling47(1), 47–58 (2007)
Schuffenhauer, A., Ertl, P., Roggo, S., Wetzel, S., Koch, M.A., Waldmann, H.: The scaffold tree- visu- alization of the scaffold universe by hierarchical scaffold classification. Journal of chemical information and modeling47(1), 47–58 (2007)
work page 2007
-
[30]
Nature Methods18(12), 1524–1531 (2021)
Li, Y., Kind, T., Folz, J., Vaniya, A., Mehta, S.S., Fiehn, O.: Spectral entropy outperforms ms/ms dot product similarity for small-molecule compound identification. Nature Methods18(12), 1524–1531 (2021)
work page 2021
-
[31]
Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[32]
Veliˇ ckovi´ c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[33]
Journal of Machine learning research7(Jan), 1–30 (2006)
Demˇ sar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research7(Jan), 1–30 (2006)
work page 2006
-
[34]
Bioinformatics39(6), 354 (2023)
Hong, Y., Li, S., Welch, C.J., Tichy, S., Ye, Y., Tang, H.: 3dmolms: prediction of tandem mass spectra from 3d molecular conformations. Bioinformatics39(6), 354 (2023)
work page 2023
-
[35]
Liebal, U.W., Phan, A.N., Sudhakar, M., Raman, K., Blank, L.M.: Machine learning applications for mass spectrometry-based metabolomics. Metabolites10(6), 243 (2020)
work page 2020
-
[36]
IEEE transactions on neural networks and learning systems32(1), 4–24 (2020)
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Yu, P.S.: A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems32(1), 4–24 (2020)
work page 2020
-
[37]
Semi-Supervised Classification with Graph Convolutional Networks
Kipf, T.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016) 23
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[38]
Xia, J., Zhao, C., Hu, B., Gao, Z., Tan, C., Liu, Y., Li, S., Li, S.Z.: Mole-bert: Rethinking pre-training graph neural networks for molecules (2023)
work page 2023
-
[39]
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (long and Short Papers), pp. 4171–4186 (2019)
work page 2019
-
[40]
Advances in neural information processing systems30(2017)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems30(2017)
work page 2017
-
[41]
Advances in neural information processing systems33, 12559–12571 (2020)
Rong, Y., Bian, Y., Xu, T., Xie, W., Wei, Y., Huang, W., Huang, J.: Self-supervised graph transformer on large-scale molecular data. Advances in neural information processing systems33, 12559–12571 (2020)
work page 2020
-
[42]
Analytical Chemistry97(31), 17058–17066 (2025)
Liu, B., Tang, Z., Huan, T.: Adduct-induced variability in tandem mass spectrometry. Analytical Chemistry97(31), 17058–17066 (2025)
work page 2025
-
[43]
Deschamps, E., Calabrese, V., Schmitz, I., Hubert-Roux, M., Castagnos, D., Afonso, C.: Advances in ultra-high-resolution mass spectrometry for pharmaceutical analysis. Molecules28(5), 2061 (2023)
work page 2061
-
[44]
Proceedings of the IEEE109(1), 43–76 (2020)
Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., He, Q.: A comprehensive survey on transfer learning. Proceedings of the IEEE109(1), 43–76 (2020)
work page 2020
-
[45]
Journal of Cheminformatics12(1), 51 (2020)
Bento, A.P., Hersey, A., F´ elix, E., Landrum, G., Gaulton, A., Atkinson, F., Bellis, L.J., De Veij, M., Leach, A.R.: An open source chemical structure curation pipeline using rdkit. Journal of Cheminformatics12(1), 51 (2020)
work page 2020
-
[46]
Journal of chemical documentation5(2), 107–113 (1965)
Morgan, H.L.: The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. Journal of chemical documentation5(2), 107–113 (1965)
work page 1965
-
[47]
Journal of chemical information and computer sciences42(6), 1273–1280 (2002)
Durant, J.L., Leland, B.A., Henry, D.R., Nourse, J.G.: Reoptimization of mdl keys for use in drug discovery. Journal of chemical information and computer sciences42(6), 1273–1280 (2002)
work page 2002
-
[48]
Journal of chemical information and modeling46(1), 208–220 (2006)
Stiefl, N., Watson, I.A., Baumann, K., Zaliani, A.: Erg: 2d pharmacophore descriptions for scaffold hopping. Journal of chemical information and modeling46(1), 208–220 (2006)
work page 2006
-
[49]
James, C.A.: Daylight theory manual. http://www. daylight. com/dayhtml/doc/theory/theory. toc. html (2004)
work page 2004
-
[50]
Journal of chemical information and modeling56(2), 390–398 (2016)
Helal, K.Y., Maciejewski, M., Gregori-Puigjane, E., Glick, M., Wassermann, A.M.: Public domain hts fingerprints: design and evaluation of compound bioactivity profiles from pubchem’s bioassay repository. Journal of chemical information and modeling56(2), 390–398 (2016)
work page 2016
-
[51]
In: NeurIPS Learning Meaningful Representation of Life Workshop (2019)
Huang, K., Xiao, C., Glass, L., Sun, J.: Explainable substructure partition fingerprint for protein, drug, and more. In: NeurIPS Learning Meaningful Representation of Life Workshop (2019)
work page 2019
-
[52]
Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks
Wang, M., Zheng, D., Ye, Z., Gan, Q., Li, M., Song, X., Zhou, J., Ma, C., Yu, L., Gai, Y., et al.: Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315 (2019)
work page internal anchor Pith review arXiv 1909
-
[53]
Journal of medicinal chemistry63(16), 8749–8760 (2019)
Xiong, Z., Wang, D., Liu, X., Zhong, F., Wan, X., Li, X., Li, Z., Luo, X., Chen, K., Jiang, H.,et al.: Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal of medicinal chemistry63(16), 8749–8760 (2019)
work page 2019
-
[54]
Fast Graph Representation Learning with PyTorch Geometric
Fey, M., Lenssen, J.E.: Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[55]
Bioinformatics36(22-23), 5545–5547 (2020)
Huang, K., Fu, T., Glass, L.M., Zitnik, M., Xiao, C., Sun, J.: Deeppurpose: a deep learning library for drug–target interaction prediction. Bioinformatics36(22-23), 5545–5547 (2020)
work page 2020
-
[56]
Journal of the Franklin Institute334(2), 307–318 (1997) 24
Men´ endez, M.L., Pardo, J.A., Pardo, L., Pardo, M.d.C.: The jensen-shannon divergence. Journal of the Franklin Institute334(2), 307–318 (1997) 24
work page 1997
-
[57]
The annals of mathematical statistics 22(1), 79–86 (1951) 25 Appendix A Supplementary Fig
Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathematical statistics 22(1), 79–86 (1951) 25 Appendix A Supplementary Fig. A1Supplementary Figure 1. Performance evaluation (cosine similarity and JS divergence) of various embedders on the MassSpecGym dataset in data-limited regimes. Training sets were reduced to half and quarte...
work page 1951
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.