arxiv: 2604.11852 · v1 · submitted 2026-04-13 · 🧬 q-bio.QM · cs.AI· cs.LG

Recognition: unknown

Evaluating the Limitations of Protein Sequence Representations for Parkinson's Disease Classification

C\'esar Jes\'us N\'u\~nez-Prado, Grigori Sidorov, Liliana Chanona-Hern\'andez

Pith reviewed 2026-05-10 16:17 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.AIcs.LG

keywords Parkinson's diseaseprotein primary sequencessequence representationsmachine learning classificationbiomarkersnested cross-validationProtBERT embeddingsdiscriminative power

0 comments

The pith

Protein primary sequences alone provide only limited discriminative power for Parkinson's disease classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether information from protein primary sequences can reliably separate Parkinson's disease cases from controls. Multiple representations were examined, including amino acid composition, k-mers, physicochemical descriptors, and embeddings from protein language models. All were assessed inside a nested stratified cross-validation setup to prevent data leakage. The best result reached only moderate performance, with substantial class overlap and no statistically significant differences among the approaches. A sympathetic reader would conclude that sequence data by itself is insufficient for robust disease modeling and that structural, functional, or interaction features are needed instead.

Core claim

The evaluation demonstrates that primary sequence information alone provides limited discriminative power for Parkinson's disease classification. Across amino-acid composition, k-mers, physicochemical descriptors, hybrid features, and ProtBERT embeddings, F1 scores stayed between 0.60 and 0.70 under nested stratified cross-validation. Unsupervised analyses showed no intrinsic structure aligned with class labels, and a Friedman test found no significant performance differences. Classical k-mer methods produced highly imbalanced predictions, while even the strongest configuration (ProtBERT plus MLP) yielded an F1 of 0.704 and ROC-AUC of 0.748.

What carries the argument

Nested stratified cross-validation applied to representations derived exclusively from protein primary sequences.

If this is right

All tested sequence representations produce only moderate classification performance with substantial class overlap.
Classical k-mer features exhibit strong bias toward positive predictions, yielding high recall but low precision.
Unsupervised methods reveal no natural clustering that matches disease labels.
Robust disease modeling requires biological features beyond primary sequence, such as structural or interaction data.
The work supplies a reproducible baseline for comparing future sequence-based or multi-modal approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future biomarker work should integrate 3D structure, post-translational modifications, or protein interaction networks alongside sequence data.
The same limited signal may appear when sequence-only methods are applied to other multifactorial diseases.
Larger, cleaner protein datasets could be used to test whether the observed class overlap persists or shrinks.

Load-bearing premise

The chosen representations and dataset would detect any discriminative signal that exists in the primary sequences.

What would settle it

A primary-sequence representation that reaches F1 scores above 0.85 under identical nested stratified cross-validation on the same or a comparable Parkinson's protein dataset.

Figures

Figures reproduced from arXiv: 2604.11852 by C\'esar Jes\'us N\'u\~nez-Prado, Grigori Sidorov, Liliana Chanona-Hern\'andez.

**Figure 2.** Figure 2: Nested cross-validation scheme. The outer loop is used for performance esti [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of protein length by class. [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: further illustrates the distribution of sequence lengths, emphasizing the heavy-tailed behavior observed in the Parkinson class [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Average amino acid composition by class. [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: PCA projections for different feature representations, showing substantial over [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Confusion matrices for representative models across different feature represen [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Confusion matrix for KNN with amino acid composition. [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of sequence length by prediction type (TP, FP, FN, TN) for the [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

read the original abstract

The identification of reliable molecular biomarkers for Parkinson's disease remains challenging due to its multifactorial nature. Although protein sequences constitute a fundamental and widely available source of biological information, their standalone discriminative capacity for complex disease classification remains unclear. In this work, we present a controlled and leakage-free evaluation of multiple representations derived exclusively from protein primary sequences, including amino acid composition, k-mers, physicochemical descriptors, hybrid representations, and embeddings from protein language models, all assessed under a nested stratified cross-validation framework to ensure unbiased performance estimation. The best-performing configuration (ProtBERT + MLP) achieves an F1-score of 0.704 +/- 0.028 and ROC-AUC of 0.748 +/- 0.047, indicating only moderate discriminative performance. Classical representations such as k-mers reach comparable F1 values (up to approximately 0.667), but exhibit highly imbalanced behavior, with recall close to 0.98 and precision around 0.50, reflecting a strong bias toward positive predictions. Across representations, performance differences remain within a narrow range (F1 between 0.60 and 0.70), while unsupervised analyses reveal no intrinsic structure aligned with class labels, and statistical testing (Friedman test, p = 0.1749) does not indicate significant differences across models. These results demonstrate substantial overlap between classes and indicate that primary sequence information alone provides limited discriminative power for Parkinson's disease classification. This work establishes a reproducible baseline and provides empirical evidence that more informative biological features, such as structural, functional, or interaction-based descriptors, are required for robust disease modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sequence reps give only moderate PD classification performance with no big differences between them, but the tested set may not fully probe what sequences can offer.

read the letter

The main thing to know is that this paper finds moderate performance across several protein sequence representations for Parkinson's classification, with the best F1 around 0.70 using ProtBERT embeddings plus an MLP and no statistically significant gaps between methods. Classical k-mers reach similar F1 but show skewed precision-recall, and unsupervised views find no class-aligned structure. The Friedman test backs the lack of differences at p=0.1749. This supplies a clean negative result for this specific disease and a reproducible baseline that discourages over-reliance on sequence-only features. It pushes toward structural or interaction data instead, which is a practical takeaway. The controlled nested stratified cross-validation and leakage-free design are the parts done well; they make the performance numbers credible within the tested space. The work is narrow but honest about its scope. The soft spots sit in how completely the representations cover possible sequence signal. ProtBERT appears as fixed features without fine-tuning, stronger contemporary encoders like ESM-2 are absent, and k-mer ranges plus hybrid construction details stay light. Dataset provenance and checks for batch effects or label noise also cannot be verified from the abstract. These gaps do not break the moderate-performance observation, but they leave room for the claim of limited discriminative power to be representation-dependent rather than absolute. The paper is for computational biologists working on disease biomarkers or protein representation benchmarks. Readers who need empirical grounding against sequence-only optimism will find it useful. It deserves peer review because the evaluation framework is sound and the negative finding for Parkinson's is worth documenting, though it would improve with expanded modern baselines.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates multiple representations derived solely from protein primary sequences—including amino acid composition, k-mers, physicochemical descriptors, hybrids, and ProtBERT embeddings—for Parkinson's disease classification. Using nested stratified cross-validation, it reports moderate performance (best F1-score 0.704 ± 0.028 and ROC-AUC 0.748 ± 0.047 with ProtBERT + MLP), comparable results across classical representations (F1 up to ~0.667 but with imbalanced precision/recall), no intrinsic class structure in unsupervised analyses, and no significant differences via Friedman test (p=0.1749). The authors conclude that primary sequence information alone provides limited discriminative power and that structural, functional, or interaction-based features are needed.

Significance. If the central claim holds, the work supplies a reproducible, leakage-free baseline that quantifies the insufficiency of sequence-only approaches for PD classification and usefully directs attention toward richer biological descriptors. Strengths include the nested CV protocol, statistical testing, and focus on unbiased estimation. The result is of moderate significance for the field, as it empirically documents representational limits but would gain impact from stronger validation that the tested feature sets are representative of what primary sequences can offer.

major comments (2)

[Abstract and Results] Abstract and Results: The claim that primary sequence information alone provides limited discriminative power is load-bearing for the paper's conclusion. It rests on the tested representations (amino-acid composition, k-mers, physicochemical descriptors, hybrids, and fixed ProtBERT embeddings) being sufficient to surface any existing signal. The manuscript does not fine-tune ProtBERT, does not specify k-mer ranges or hybrid construction, and omits comparison to stronger contemporary encoders such as ESM-2; therefore moderate performance and non-significant differences (Friedman p=0.1749) could reflect representational inadequacy rather than absence of sequence-based information.
[Methods (data section)] Methods (data section): The representativeness assumption and absence of label noise or batch effects are central to interpreting the moderate performance as evidence of limited sequence power. Explicit details on protein dataset provenance, exact preprocessing pipeline, and any batch-effect diagnostics are required to support the claim that the evaluated sequences are a fair test of primary-sequence utility.

minor comments (2)

[Abstract] Abstract: The statement that classical representations reach F1 values 'up to approximately 0.667' would be clearer if the full per-representation table of F1, precision, recall, and ROC-AUC values were referenced or summarized.
The unsupervised analyses and Friedman test results are presented without citing the exact number of representations or folds used; adding these numbers would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results: The claim that primary sequence information alone provides limited discriminative power is load-bearing for the paper's conclusion. It rests on the tested representations (amino-acid composition, k-mers, physicochemical descriptors, hybrids, and fixed ProtBERT embeddings) being sufficient to surface any existing signal. The manuscript does not fine-tune ProtBERT, does not specify k-mer ranges or hybrid construction, and omits comparison to stronger contemporary encoders such as ESM-2; therefore moderate performance and non-significant differences (Friedman p=0.1749) could reflect representational inadequacy rather than absence of sequence-based information.

Authors: We thank the referee for highlighting this key aspect of our claims. Our use of fixed ProtBERT embeddings (without fine-tuning) was a deliberate design choice to evaluate standard pre-trained representations in a strictly leakage-free nested CV setting; fine-tuning on this modest-sized, imbalanced dataset would have introduced substantial overfitting risk and complicated unbiased performance estimation. We will add an explicit justification for this choice in the revised Discussion. We will also specify the exact k-mer range (k = 1–5) and hybrid construction details (concatenation of amino-acid composition with selected physicochemical descriptors) in the Methods section. While ESM-2 is a more recent and potentially stronger encoder, the study was performed with ProtBERT as the representative protein language model available during the work; the fact that classical, non-embedding methods yield statistically indistinguishable moderate performance (F1 range 0.60–0.70, Friedman p = 0.1749) and that unsupervised visualizations show no class-aligned structure indicates that the limited discriminative power is not an artifact of any single representation family. revision: partial
Referee: [Methods (data section)] Methods (data section): The representativeness assumption and absence of label noise or batch effects are central to interpreting the moderate performance as evidence of limited sequence power. Explicit details on protein dataset provenance, exact preprocessing pipeline, and any batch-effect diagnostics are required to support the claim that the evaluated sequences are a fair test of primary-sequence utility.

Authors: We agree that these details are necessary for full interpretability and reproducibility. In the revised manuscript we will expand the Data and Methods sections to provide: (i) complete provenance (sequences drawn from UniProt with explicit PD-associated and control selection criteria), (ii) the full preprocessing pipeline (length filtering 50–2000 residues, duplicate removal, and handling of ambiguous residues), and (iii) batch-effect diagnostics (PCA and silhouette analysis on both raw and embedded feature spaces showing no significant source- or batch-driven clustering). These additions will be accompanied by a short supplementary note confirming the absence of detectable label noise or batch effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; direct empirical evaluation from held-out data

full rationale

The paper reports a controlled empirical evaluation of protein sequence representations (amino-acid composition, k-mers, physicochemical descriptors, hybrids, and ProtBERT embeddings) for Parkinson's disease classification. All reported metrics (F1-score 0.704, ROC-AUC 0.748, Friedman test p=0.1749) are computed via nested stratified cross-validation on held-out folds, with no derivations, fitted parameters renamed as predictions, or self-referential equations. Unsupervised analyses and statistical comparisons operate directly on the computed performance values from independent data splits. No load-bearing self-citations or ansatzes appear in the central claims; the conclusion of limited discriminative power follows from the observed moderate performance and class overlap rather than any definitional reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the empirical performance metrics obtained under nested cross-validation rather than on new theoretical axioms or invented biological entities. Standard assumptions of cross-validation unbiasedness and dataset label accuracy are invoked but not derived.

free parameters (1)

MLP and classifier hyperparameters
Performance depends on hyperparameters tuned inside the inner cross-validation loop; these are fitted to the data and affect the reported F1 and AUC.

axioms (2)

domain assumption Nested stratified cross-validation yields unbiased performance estimates
Invoked to justify the reported F1 and ROC-AUC values as reliable.
domain assumption The protein sequence dataset labels are accurate and representative
Required for the conclusion that sequence information alone is limited.

pith-pipeline@v0.9.0 · 5612 in / 1456 out tokens · 42122 ms · 2026-05-10T16:17:24.867019+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 14 canonical work pages

[1]

P. Prajjwal, et al., Parkinson’s disease updates: Addressing the patho- physiology, risk factors, genetics, diagnosis, along with the medical and 33 surgical treatment, Annals of Medicine and Surgery 85 (10) (2023) 4887– 4902.doi:10.1097/MS9.0000000000001142

work page doi:10.1097/ms9.0000000000001142 2023
[2]

M. Muleiro Alvarez, et al., A comprehensive approach to Parkinson’s disease: Addressing its molecular, clinical, and therapeutic aspects, In- ternational Journal of Molecular Sciences 25 (13) (2024) 7183.doi: 10.3390/ijms25137183

work page doi:10.3390/ijms25137183 2024
[3]

J. S. Bogers, B. R. Bloem, J. M. Den Heijer, The etiology of Parkinson’s disease, Journal of Parkinson’s Disease 13 (2023) 1281–1288.doi:10. 3233/JPD-230250

2023
[4]

Srinivasan, G

E. Srinivasan, G. Chandrasekhar, P. Chandrasekar, K. Anbarasu, A. S. Vickram, R. Karunakaran, R. Rajasekaran, P. S. Srikumar, Alpha- synuclein aggregation in Parkinson’s disease, Frontiers in Medicine 8 (2021) 736978.doi:10.3389/fmed.2021.736978

work page doi:10.3389/fmed.2021.736978 2021
[5]

F. F. Geibl, et al., Alpha-synuclein pathology disrupts mitochondrial function, Molecular Neurodegeneration 19 (2024) 69.doi:10.1186/ s13024-024-00756-2

2024
[7]

Dong-Chen, et al., Signaling pathways in Parkinson’s disease, Sig- nal Transduction and Targeted Therapy 8 (2023) 73.doi:10.1038/ s41392-023-01353-3

X. Dong-Chen, et al., Signaling pathways in Parkinson’s disease, Sig- nal Transduction and Targeted Therapy 8 (2023) 73.doi:10.1038/ s41392-023-01353-3

2023
[8]

M. S. Khan, et al., Parkinson disease signaling pathways, International Journal of Molecular Sciences 26 (13) (2025) 6416.doi:10.3390/ ijms26136416

2025
[9]

Blesa, et al., Oxidative stress and Parkinson’s disease, Frontiers in Neuroanatomy 9 (2015) 91.doi:10.3389/fnana.2015.00091

J. Blesa, et al., Oxidative stress and Parkinson’s disease, Frontiers in Neuroanatomy 9 (2015) 91.doi:10.3389/fnana.2015.00091

work page doi:10.3389/fnana.2015.00091 2015
[10]

Zarkali, et al., Neuroimaging and fluid biomarkers in Parkin- son’s disease, Nature Communications 15 (2024) 5661.doi:10.1038/ s41467-024-49949-9

A. Zarkali, et al., Neuroimaging and fluid biomarkers in Parkin- son’s disease, Nature Communications 15 (2024) 5661.doi:10.1038/ s41467-024-49949-9. 34

2024
[11]

Mei, et al., Machine learning for Parkinson’s disease diagnosis, Fron- tiers in Aging Neuroscience 13 (2021) 633752.doi:10.3389/fnagi

J. Mei, et al., Machine learning for Parkinson’s disease diagnosis, Fron- tiers in Aging Neuroscience 13 (2021) 633752.doi:10.3389/fnagi. 2021.633752

work page doi:10.3389/fnagi 2021
[12]

Díaz-Ramírez, J

A. Díaz-Ramírez, J. Díaz-Escobar, V. Quintero-Rosas, R. Moncada- Sánchez, Classification of fall events in the elderly using a thermal sensor and machine learning techniques, Computación y Sistemas 28 (4) (2024) 1773–1787.doi:10.13053/cys-28-4-4809

work page doi:10.13053/cys-28-4-4809 2024
[13]

Rabie, M

H. Rabie, M. A. Akhloufi, Machine learning and deep learning for Parkinson’s disease detection, Discover Artificial Intelligence 5 (2025) 24.doi:10.1007/s44163-025-00241-9

work page doi:10.1007/s44163-025-00241-9 2025
[14]

S. Seo, M. Oh, Y. Park, S. Kim, Deepfam: deep learning based alignment-free method for protein family modeling and pre- diction, Bioinformatics 34 (13) (2018) i254–i262.doi:10.1093/ bioinformatics/bty275

2018
[15]

J. J. Almagro Armenteros, C. K. Sønderby, S. K. Sønderby, H. Nielsen, O. Winther, Deeploc: prediction of protein subcellular localization using deep learning, Bioinformatics 33 (21) (2017) 3387–3395.doi:10.1093/ bioinformatics/btx431

2017
[16]

Asgari, M

E. Asgari, M. R. Mofrad, Continuous distributed representation of bio- logical sequences for deep proteomics and genomics, PLoS ONE 10 (11) (2015) e0141287.doi:10.1371/journal.pone.0141287

work page doi:10.1371/journal.pone.0141287 2015
[17]

Rao, et al., Evaluating protein transfer learning with tape, in: Ad- vances in Neural Information Processing Systems, 2019

R. Rao, et al., Evaluating protein transfer learning with tape, in: Ad- vances in Neural Information Processing Systems, 2019

2019
[18]

ProtTrans : Toward understanding the language of life through self-supervised learning

A. Elnaggar, M. Heinzinger, C. Dallago, G. Rehawi, Y. Wang, L. Jones, T. Gibbs, T. Feher, C. Angerer, D. Bhowmik, B. Rost, Prottrans: Toward cracking the language of life’s code through self-supervised deep learning and high performance computing, IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (10) (2022) 7112–7127. doi:10.1109/TPAMI.20...

work page doi:10.1109/tpami.2021.3095381 2022
[19]

A. Rives, et al., Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, Proceedings of the National Academy of Sciences 118 (15) (2021) e2016239118.doi: 10.1073/pnas.2016239118. 35

work page doi:10.1073/pnas.2016239118 2021
[20]

K.-C. Chou, Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology, Journal of Theoretical Biology 273 (2011) 236–247.doi:10.1016/j.jtbi.2010.12.024

work page doi:10.1016/j.jtbi.2010.12.024 2011
[21]

Zielezinski, et al., Alignment-free sequence comparison, Genome Bi- ology 18 (2017) 186.doi:10.1186/s13059-017-1319-7

A. Zielezinski, et al., Alignment-free sequence comparison, Genome Bi- ology 18 (2017) 186.doi:10.1186/s13059-017-1319-7

work page doi:10.1186/s13059-017-1319-7 2017
[22]

Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

Z. Lin, et al., Evolutionary-scale prediction of atomic-level protein struc- ture with a language model, Science 379 (6637) (2023) 1123–1130. doi:10.1126/science.ade2574

work page doi:10.1126/science.ade2574 2023
[23]

Radivojac, W

P. Radivojac, W. T. Clark, T. Oron, A. M. Schnoes, T. Wittkop, A. Sokolov, K. Graim, C. Funk, K. Verspoor, A. Ben-Hur, G. Pandey, J. M. Yunes, A. S. Talwalkar, S. Repo, M. L. Souza, D. Piovesan, R. Casadio, Z. Wang, J. Cheng, H. Fang, J. Gough, P. Koskinen, P. Törönen, J. Nokso-Koivisto, L. Holm, D. Cozzetto, D. W. Buchan, K.Bryson, D.T.Jones, B.Limaye, H...

work page doi:10.1038/nmeth.2340 2013