Recognition: no theorem link
Better Protein Function Prediction by Modeling Survivorship Bias
Pith reviewed 2026-05-11 01:30 UTC · model grok-4.3
The pith
Modeling nucleotide mutation rates distinguishes survivorship bias in protein data and improves function prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Evo-PU is a positive-unlabeled learning framework that uses a scientific model of nucleotide mutation rates to account for survivorship bias in single-organism protein sequence data. It distinguishes missing sequences that are likely non-functional (because they are one mutation away from observed variants) from those missing because they are unlikely to arise through mutation. When applied to well-surveilled organisms, this approach yields better predictions of protein functionality than methods that ignore the evolutionary process generating the observed data.
What carries the argument
Evo-PU framework, which models the probability that an unobserved sequence would have arisen and been observed if functional, using nucleotide mutation rates to adjust treatment of unlabeled examples in PU learning.
Load-bearing premise
That a scientific model of nucleotide mutation rates can reliably distinguish missing sequences due to non-functionality from those missing because they never arose through mutation.
What would settle it
If Evo-PU shows no accuracy gain over standard PU learning when predicting held-out functional versus non-functional variants in influenza or RSV mutagenesis data, the benefit of the mutation-based bias model would be refuted.
Figures
read the original abstract
Protein sequence data from nature exhibits survivorship bias: we only observe data from those organisms that survive and reproduce, while non-functional protein mutations are eliminated by natural selection. Thus, predicting whether a protein sequence is functional often requires learning from positive examples alone. While positive-unlabeled (PU) learning frameworks offer a generic solution to this problem, existing PU methods ignore the evolutionary processes that shape sequence observability and cause survivorship bias. Consider a sequence that is one mutation away from a commonly-observed protein variant in a well-surveilled organism. If the sequence were functional, it would likely be observed. If it is not observed, this suggests non-functionality. In contrast, sequences that are unlikely to arise through mutation may be missing simply because they never arose. Thus, these two kinds of missing sequences should be treated differently when training models. In this work, we propose Evo-PU, a PU learning framework that uses a scientific understanding of nucleotide mutation to model survivorship bias for well-surveilled single-organism sequence data. On three prediction tasks using single-organism uniform-coverage surveillance data -- predicting results from held-out influenza and respiratory syncytial virus (RSV) mutagenesis studies, and predicting future SARS-CoV-2 variants -- Evo-PU outperforms standard PU learning, one-class classification (OCC), and protein language models (PLMs). On prediction tasks from multi-organism ProteinGym datasets with more heterogeneous surveillance coverage, we identify opportunities to generalize our approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Evo-PU, a positive-unlabeled (PU) learning framework for protein function prediction that incorporates a nucleotide mutation model to account for survivorship bias in single-organism surveillance data. It claims superior performance over standard PU learning, one-class classification, and protein language models on three tasks: predicting held-out mutagenesis results for influenza and RSV, and future SARS-CoV-2 variants. For multi-organism data, it discusses generalization opportunities.
Significance. If the central claim holds, the work offers a principled way to integrate evolutionary mutation knowledge into PU learning for biased sequence data, potentially improving functional predictions for well-surveilled viral proteins. The approach is notable for using a scientific model of mutation rather than purely data-driven heuristics, which could generalize to other evolutionary settings if the assumptions are validated.
major comments (2)
- [Methods] Methods section (mutation model and reweighting procedure): The central distinction between non-functional missing sequences (one mutation from observed variants) and those missing due to low mutation probability relies on the nucleotide mutation model accurately reflecting the data-generating process. No validation, calibration, or sensitivity analysis of this model against the influenza, RSV, or SARS-CoV-2 surveillance datasets is reported, which is load-bearing for the claim that Evo-PU improves upon standard PU learning rather than reducing to it with an unverified heuristic.
- [Results] Results section (performance on three tasks): The outperformance claims on held-out mutagenesis prediction and future variant prediction are presented without reported statistical significance tests, confidence intervals, or ablation studies isolating the contribution of the mutation-based reweighting versus other modeling choices. This makes it difficult to assess whether the gains are robust or attributable to the survivorship bias modeling.
minor comments (2)
- [Abstract] The abstract and introduction could more clearly specify the exact form of the mutation model (e.g., substitution matrix, context dependence) and how unlabeled examples are reweighted in the PU objective.
- [Methods] Notation for the PU risk or reweighting function should be defined explicitly with reference to standard PU formulations to aid readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments on our manuscript. We have carefully reviewed the major concerns and provide point-by-point responses below. Both issues raised are valid and can be addressed through targeted revisions and additional analyses, which we will incorporate in the next version of the paper.
read point-by-point responses
-
Referee: [Methods] Methods section (mutation model and reweighting procedure): The central distinction between non-functional missing sequences (one mutation from observed variants) and those missing due to low mutation probability relies on the nucleotide mutation model accurately reflecting the data-generating process. No validation, calibration, or sensitivity analysis of this model against the influenza, RSV, or SARS-CoV-2 surveillance datasets is reported, which is load-bearing for the claim that Evo-PU improves upon standard PU learning rather than reducing to it with an unverified heuristic.
Authors: We agree that the manuscript would be strengthened by explicit validation and sensitivity analysis of the nucleotide mutation model on the specific datasets. The model parameters are drawn from established literature on viral nucleotide substitution rates, but we did not report dataset-specific calibration or robustness checks. In the revision, we will add a dedicated subsection to the Methods describing the mutation model in detail, include a calibration analysis comparing predicted mutation probabilities against observed variant frequencies in the surveillance data, and perform sensitivity analyses by varying key parameters (such as transition/transversion bias and overall mutation rate) while reporting the resulting changes in downstream performance. This will demonstrate that the reported gains are robust and not artifacts of unverified heuristics. revision: yes
-
Referee: [Results] Results section (performance on three tasks): The outperformance claims on held-out mutagenesis prediction and future variant prediction are presented without reported statistical significance tests, confidence intervals, or ablation studies isolating the contribution of the mutation-based reweighting versus other modeling choices. This makes it difficult to assess whether the gains are robust or attributable to the survivorship bias modeling.
Authors: The referee is correct that the current Results section lacks statistical tests, confidence intervals, and ablations isolating the mutation reweighting component. We will revise the Results to include bootstrap confidence intervals for all performance metrics on the three tasks, paired statistical significance tests (e.g., McNemar's test for classification tasks or Wilcoxon signed-rank tests) comparing Evo-PU against each baseline, and ablation experiments that remove the mutation-based reweighting (reducing to standard PU learning) while keeping all other modeling choices fixed. These will be added to the tables and text, allowing clear assessment of whether the survivorship bias modeling drives the observed improvements. revision: yes
Circularity Check
No significant circularity; method uses external mutation model as independent input
full rationale
The paper's core proposal is Evo-PU, which incorporates a pre-existing scientific model of nucleotide mutation rates to reweight unlabeled sequences in positive-unlabeled learning for protein function prediction. This mutation model is described as drawn from established biological understanding rather than fitted or derived from the surveillance or held-out prediction datasets. The evaluation on held-out influenza/RSV mutagenesis results and future SARS-CoV-2 variants constitutes an independent test, with no equations or steps in the abstract reducing the claimed performance gains to a tautology or self-fit. No self-citations, ansatzes smuggled via prior work, or renaming of known results appear as load-bearing elements. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Nucleotide mutation rates provide a reliable prior for sequence observability in well-surveilled organisms.
Reference graph
Works this paper leans on
-
[1]
Harish Nair, W Abdullah Brooks, Mark Katz, Anna Roca, James A Berkley, Shabir A Madhi, James Mark Simmerman, Aubree Gordon, Masatoki Sato, Stephen Howie, et al. Global burden of respiratory infections due to seasonal influenza in young children: a systematic review and meta-analysis.The Lancet, 378(9807):1917–1930,
work page 1917
-
[2]
One-class classification: A survey.arXiv preprint arXiv:2101.03064,
Pramuditha Perera, Poojan Oza, and Vishal M Patel. One-class classification: A survey.arXiv preprint arXiv:2101.03064,
-
[3]
Adam Thomas, Benjamin D Evans, Mark van der Giezen, and Nicholas J Harmer. Survivor bias drives overestimation of stability in reconstructed ancestral proteins.bioRxiv, pages 2022–11,
work page 2022
-
[4]
12 Appendix A Literature review In this section, we review existing methods relevant to our Evo-PU framework. We first discuss general approaches, including PU learning, one-class classification (OCC), and protein language model (PLM)-based methods, and then highlight specific studies that directly address protein applications, which are most relevant to ...
work page 2002
-
[5]
were used as training, including 657 unique nucleotide sequences that encode 357 unique amino acid sequences. In our framework, we designate the nucleotide datasets as the observed nucleotide dataset DY used to compute the emergence probability presented in the model proposed in Section 2.3 as a part of the approximated log-likelihood function in Eq. (4)....
work page 1999
-
[6]
Among the 44 test sequences, 23 show binding affinity to human influenza receptors, while the remaining 21 sequences show no binding. For the influenza evasion task, the test set contains 51 peptide sequences collected in 2025 and labeled as evasive (functional). To form the non-evasive class, we randomly sampled 51 observed nucleotide sequences and intro...
work page 2025
-
[7]
with observed frequency were used as test data. D A wide and deep neural network architecture We customized a neural network structure inspired by [Cheng et al., 2016] integrates linear memo- rization with nonlinear generalization for protein function classification. The model takes an input feature vector and the input is processed through two parallel b...
work page 2016
-
[8]
During training, positive sequences are treated as positive, while unlabeled sequences are treated as negative. For each choice of k, we perform 10-fold cross-validation and select the best k based on the average AUC. The final model is trained using the selected k on the full set of positive and unlabeled sequences and evaluated on the test set. A simila...
work page 2025
-
[9]
For Protein-PU, we generate 10 unlabeled datasets and report average metric values and errors across them. For 2Step, we use the same 10 unlabeled datasets as in Protein-PU; for each dataset, we run 10 independent trials with different spy assignments and report average metric values with error bars across all runs. For OC-SVM, iForest and k-NN (with ESM2...
work page 2011
-
[10]
Anomaly detection using one-class neural networks.arXiv preprint arXiv:1802.06360,
R Chalapathy. Anomaly detection using one-class neural networks.arXiv preprint arXiv:1802.06360,
-
[11]
A random forest based approach for one class classification in medical imaging
Chesner Désir, Simon Bernard, Caroline Petitjean, and Laurent Heutte. A random forest based approach for one class classification in medical imaging. InMachine Learning in Medical Imaging: Third International Workshop, MLMI 2012, Held in Conjunction with MICCAI 2012, Nice, France, October 1, 2012, Revised Selected Papers 3, pages 250–257. Springer,
work page 2012
-
[12]
Deep multi-sphere support vector data description
Zahra Ghafoori and Christopher Leckie. Deep multi-sphere support vector data description. In Proceedings of the 2020 SIAM International Conference on Data Mining, pages 109–117. SIAM,
work page 2020
-
[13]
Negative training data can be harmful to text classification
Xiao-Li Li, Bing Liu, and See Kiong Ng. Negative training data can be harmful to text classification. InProceedings of the 2010 conference on empirical methods in natural language processing, pages 218–228,
work page 2010
-
[14]
NCBI. Viral surveillance and subtyping interface (vssi).https://www.ncbi.nlm.nih.gov/labs/ virus/vssi/#/[Accessed: (2025-05-15)]. NCBI. Sars-cov-2 data hub,
work page 2025
-
[15]
https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/ virus?SeqType_s=Nucleotide&VirusLineage_ss=Severe%20acute%20respiratory% 20syndrome%20coronavirus%202,%20taxid:2697049&CollectionDate_dr= 2020-01-01T00:00:00.00Z%20TO%202026-01-21T23:59:59.00Z&HostLineage_ss= Homo%20sapiens%20(human),%20taxid:9606. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James B...
work page 2020
-
[16]
Andrew Skabar. Single-class classifier learning using neural networks: An application to the prediction of mineral deposits. InProceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No. 03EX693), volume 4, pages 2127–2132. IEEE,
work page 2003
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.