arxiv: 2605.06879 · v1 · submitted 2026-05-07 · 💻 cs.LG · q-bio.QM

Recognition: no theorem link

Better Protein Function Prediction by Modeling Survivorship Bias

Ekaterina Selivanovitch, Peter I. Frazier, Poompol Buathong, Susan Daniel, Zhongmou Chao

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:30 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM

keywords protein function predictionpositive-unlabeled learningsurvivorship biasprotein sequencesmutation modelingviral variantsmachine learning

0 comments

The pith

Modeling nucleotide mutation rates distinguishes survivorship bias in protein data and improves function prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Protein sequence data from nature only shows functional variants because non-functional ones are eliminated by selection, creating survivorship bias that makes it hard to predict functionality from positive examples alone. Standard positive-unlabeled learning treats all missing sequences the same, but the paper shows that sequences unlikely to arise by mutation should be handled differently from those that probably would have been observed if functional. Evo-PU incorporates a model of nucleotide mutation probabilities into PU learning to reweight or classify the unlabeled data accordingly. This yields better results than ordinary PU methods, one-class classification, and protein language models on tasks using uniform single-organism surveillance data.

Core claim

Evo-PU is a positive-unlabeled learning framework that uses a scientific model of nucleotide mutation rates to account for survivorship bias in single-organism protein sequence data. It distinguishes missing sequences that are likely non-functional (because they are one mutation away from observed variants) from those missing because they are unlikely to arise through mutation. When applied to well-surveilled organisms, this approach yields better predictions of protein functionality than methods that ignore the evolutionary process generating the observed data.

What carries the argument

Evo-PU framework, which models the probability that an unobserved sequence would have arisen and been observed if functional, using nucleotide mutation rates to adjust treatment of unlabeled examples in PU learning.

Load-bearing premise

That a scientific model of nucleotide mutation rates can reliably distinguish missing sequences due to non-functionality from those missing because they never arose through mutation.

What would settle it

If Evo-PU shows no accuracy gain over standard PU learning when predicting held-out functional versus non-functional variants in influenza or RSV mutagenesis data, the benefit of the mutation-based bias model would be refuted.

Figures

Figures reproduced from arXiv: 2605.06879 by Ekaterina Selivanovitch, Peter I. Frazier, Poompol Buathong, Susan Daniel, Zhongmou Chao.

**Figure 2.** Figure 2: AUC performance comparison of Evo-PU on ProteinGym tasks: (a) A0 and (b) PSAE. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

Protein sequence data from nature exhibits survivorship bias: we only observe data from those organisms that survive and reproduce, while non-functional protein mutations are eliminated by natural selection. Thus, predicting whether a protein sequence is functional often requires learning from positive examples alone. While positive-unlabeled (PU) learning frameworks offer a generic solution to this problem, existing PU methods ignore the evolutionary processes that shape sequence observability and cause survivorship bias. Consider a sequence that is one mutation away from a commonly-observed protein variant in a well-surveilled organism. If the sequence were functional, it would likely be observed. If it is not observed, this suggests non-functionality. In contrast, sequences that are unlikely to arise through mutation may be missing simply because they never arose. Thus, these two kinds of missing sequences should be treated differently when training models. In this work, we propose Evo-PU, a PU learning framework that uses a scientific understanding of nucleotide mutation to model survivorship bias for well-surveilled single-organism sequence data. On three prediction tasks using single-organism uniform-coverage surveillance data -- predicting results from held-out influenza and respiratory syncytial virus (RSV) mutagenesis studies, and predicting future SARS-CoV-2 variants -- Evo-PU outperforms standard PU learning, one-class classification (OCC), and protein language models (PLMs). On prediction tasks from multi-organism ProteinGym datasets with more heterogeneous surveillance coverage, we identify opportunities to generalize our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Evo-PU adds a mutation model to PU learning to treat missing protein sequences differently based on how likely they are to arise, with reported gains on viral variant tasks, but the abstract leaves the validation thin.

read the letter

Evo-PU is a positive-unlabeled method that brings in a nucleotide mutation model to handle survivorship bias in natural protein sequences. The core move is to treat a missing sequence one mutation away from observed ones as more likely non-functional, while treating distant missing sequences as simply unlikely to have occurred. They apply this to single-organism surveillance data and report better results than standard PU learning, one-class classification, and protein language models on held-out influenza and RSV mutagenesis experiments plus SARS-CoV-2 variant forecasting. On broader ProteinGym sets they note the approach needs more work for heterogeneous coverage.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Evo-PU, a positive-unlabeled (PU) learning framework for protein function prediction that incorporates a nucleotide mutation model to account for survivorship bias in single-organism surveillance data. It claims superior performance over standard PU learning, one-class classification, and protein language models on three tasks: predicting held-out mutagenesis results for influenza and RSV, and future SARS-CoV-2 variants. For multi-organism data, it discusses generalization opportunities.

Significance. If the central claim holds, the work offers a principled way to integrate evolutionary mutation knowledge into PU learning for biased sequence data, potentially improving functional predictions for well-surveilled viral proteins. The approach is notable for using a scientific model of mutation rather than purely data-driven heuristics, which could generalize to other evolutionary settings if the assumptions are validated.

major comments (2)

[Methods] Methods section (mutation model and reweighting procedure): The central distinction between non-functional missing sequences (one mutation from observed variants) and those missing due to low mutation probability relies on the nucleotide mutation model accurately reflecting the data-generating process. No validation, calibration, or sensitivity analysis of this model against the influenza, RSV, or SARS-CoV-2 surveillance datasets is reported, which is load-bearing for the claim that Evo-PU improves upon standard PU learning rather than reducing to it with an unverified heuristic.
[Results] Results section (performance on three tasks): The outperformance claims on held-out mutagenesis prediction and future variant prediction are presented without reported statistical significance tests, confidence intervals, or ablation studies isolating the contribution of the mutation-based reweighting versus other modeling choices. This makes it difficult to assess whether the gains are robust or attributable to the survivorship bias modeling.

minor comments (2)

[Abstract] The abstract and introduction could more clearly specify the exact form of the mutation model (e.g., substitution matrix, context dependence) and how unlabeled examples are reweighted in the PU objective.
[Methods] Notation for the PU risk or reweighting function should be defined explicitly with reference to standard PU formulations to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. We have carefully reviewed the major concerns and provide point-by-point responses below. Both issues raised are valid and can be addressed through targeted revisions and additional analyses, which we will incorporate in the next version of the paper.

read point-by-point responses

Referee: [Methods] Methods section (mutation model and reweighting procedure): The central distinction between non-functional missing sequences (one mutation from observed variants) and those missing due to low mutation probability relies on the nucleotide mutation model accurately reflecting the data-generating process. No validation, calibration, or sensitivity analysis of this model against the influenza, RSV, or SARS-CoV-2 surveillance datasets is reported, which is load-bearing for the claim that Evo-PU improves upon standard PU learning rather than reducing to it with an unverified heuristic.

Authors: We agree that the manuscript would be strengthened by explicit validation and sensitivity analysis of the nucleotide mutation model on the specific datasets. The model parameters are drawn from established literature on viral nucleotide substitution rates, but we did not report dataset-specific calibration or robustness checks. In the revision, we will add a dedicated subsection to the Methods describing the mutation model in detail, include a calibration analysis comparing predicted mutation probabilities against observed variant frequencies in the surveillance data, and perform sensitivity analyses by varying key parameters (such as transition/transversion bias and overall mutation rate) while reporting the resulting changes in downstream performance. This will demonstrate that the reported gains are robust and not artifacts of unverified heuristics. revision: yes
Referee: [Results] Results section (performance on three tasks): The outperformance claims on held-out mutagenesis prediction and future variant prediction are presented without reported statistical significance tests, confidence intervals, or ablation studies isolating the contribution of the mutation-based reweighting versus other modeling choices. This makes it difficult to assess whether the gains are robust or attributable to the survivorship bias modeling.

Authors: The referee is correct that the current Results section lacks statistical tests, confidence intervals, and ablations isolating the mutation reweighting component. We will revise the Results to include bootstrap confidence intervals for all performance metrics on the three tasks, paired statistical significance tests (e.g., McNemar's test for classification tasks or Wilcoxon signed-rank tests) comparing Evo-PU against each baseline, and ablation experiments that remove the mutation-based reweighting (reducing to standard PU learning) while keeping all other modeling choices fixed. These will be added to the tables and text, allowing clear assessment of whether the survivorship bias modeling drives the observed improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method uses external mutation model as independent input

full rationale

The paper's core proposal is Evo-PU, which incorporates a pre-existing scientific model of nucleotide mutation rates to reweight unlabeled sequences in positive-unlabeled learning for protein function prediction. This mutation model is described as drawn from established biological understanding rather than fitted or derived from the surveillance or held-out prediction datasets. The evaluation on held-out influenza/RSV mutagenesis results and future SARS-CoV-2 variants constitutes an independent test, with no equations or steps in the abstract reducing the claimed performance gains to a tautology or self-fit. No self-citations, ansatzes smuggled via prior work, or renaming of known results appear as load-bearing elements. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that nucleotide mutation processes can be modeled accurately enough to separate functional from non-functional missing data; no free parameters or invented entities are specified in the abstract.

axioms (1)

domain assumption Nucleotide mutation rates provide a reliable prior for sequence observability in well-surveilled organisms.
Invoked to differentiate types of missing sequences in the PU framework.

pith-pipeline@v0.9.0 · 5580 in / 1181 out tokens · 34981 ms · 2026-05-11T01:30:20.366351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Global burden of respiratory infections due to seasonal influenza in young children: a systematic review and meta-analysis.The Lancet, 378(9807):1917–1930,

Harish Nair, W Abdullah Brooks, Mark Katz, Anna Roca, James A Berkley, Shabir A Madhi, James Mark Simmerman, Aubree Gordon, Masatoki Sato, Stephen Howie, et al. Global burden of respiratory infections due to seasonal influenza in young children: a systematic review and meta-analysis.The Lancet, 378(9807):1917–1930,

work page 1917
[2]

One-class classification: A survey.arXiv preprint arXiv:2101.03064,

Pramuditha Perera, Poojan Oza, and Vishal M Patel. One-class classification: A survey.arXiv preprint arXiv:2101.03064,

work page arXiv
[3]

Survivor bias drives overestimation of stability in reconstructed ancestral proteins.bioRxiv, pages 2022–11,

Adam Thomas, Benjamin D Evans, Mark van der Giezen, and Nicholas J Harmer. Survivor bias drives overestimation of stability in reconstructed ancestral proteins.bioRxiv, pages 2022–11,

work page 2022
[4]

evasion peptide

12 Appendix A Literature review In this section, we review existing methods relevant to our Evo-PU framework. We first discuss general approaches, including PU learning, one-class classification (OCC), and protein language model (PLM)-based methods, and then highlight specific studies that directly address protein applications, which are most relevant to ...

work page 2002
[5]

were used as training, including 657 unique nucleotide sequences that encode 357 unique amino acid sequences. In our framework, we designate the nucleotide datasets as the observed nucleotide dataset DY used to compute the emergence probability presented in the model proposed in Section 2.3 as a part of the approximated log-likelihood function in Eq. (4)....

work page 1999
[6]

For the influenza evasion task, the test set contains 51 peptide sequences collected in 2025 and labeled as evasive (functional)

Among the 44 test sequences, 23 show binding affinity to human influenza receptors, while the remaining 21 sequences show no binding. For the influenza evasion task, the test set contains 51 peptide sequences collected in 2025 and labeled as evasive (functional). To form the non-evasive class, we randomly sampled 51 observed nucleotide sequences and intro...

work page 2025
[7]

with observed frequency were used as test data. D A wide and deep neural network architecture We customized a neural network structure inspired by [Cheng et al., 2016] integrates linear memo- rization with nonlinear generalization for protein function classification. The model takes an input feature vector and the input is processed through two parallel b...

work page 2016
[8]

For each choice of k, we perform 10-fold cross-validation and select the best k based on the average AUC

During training, positive sequences are treated as positive, while unlabeled sequences are treated as negative. For each choice of k, we perform 10-fold cross-validation and select the best k based on the average AUC. The final model is trained using the selected k on the full set of positive and unlabeled sequences and evaluated on the test set. A simila...

work page 2025
[9]

For Protein-PU, we generate 10 unlabeled datasets and report average metric values and errors across them. For 2Step, we use the same 10 unlabeled datasets as in Protein-PU; for each dataset, we run 10 independent trials with different spy assignments and report average metric values with error bars across all runs. For OC-SVM, iForest and k-NN (with ESM2...

work page 2011
[10]

Anomaly detection using one-class neural networks.arXiv preprint arXiv:1802.06360,

R Chalapathy. Anomaly detection using one-class neural networks.arXiv preprint arXiv:1802.06360,

work page arXiv
[11]

A random forest based approach for one class classification in medical imaging

Chesner Désir, Simon Bernard, Caroline Petitjean, and Laurent Heutte. A random forest based approach for one class classification in medical imaging. InMachine Learning in Medical Imaging: Third International Workshop, MLMI 2012, Held in Conjunction with MICCAI 2012, Nice, France, October 1, 2012, Revised Selected Papers 3, pages 250–257. Springer,

work page 2012
[12]

Deep multi-sphere support vector data description

Zahra Ghafoori and Christopher Leckie. Deep multi-sphere support vector data description. In Proceedings of the 2020 SIAM International Conference on Data Mining, pages 109–117. SIAM,

work page 2020
[13]

Negative training data can be harmful to text classification

Xiao-Li Li, Bing Liu, and See Kiong Ng. Negative training data can be harmful to text classification. InProceedings of the 2010 conference on empirical methods in natural language processing, pages 218–228,

work page 2010
[14]

Viral surveillance and subtyping interface (vssi).https://www.ncbi.nlm.nih.gov/labs/ virus/vssi/#/[Accessed: (2025-05-15)]

NCBI. Viral surveillance and subtyping interface (vssi).https://www.ncbi.nlm.nih.gov/labs/ virus/vssi/#/[Accessed: (2025-05-15)]. NCBI. Sars-cov-2 data hub,

work page 2025
[15]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al

https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/ virus?SeqType_s=Nucleotide&VirusLineage_ss=Severe%20acute%20respiratory% 20syndrome%20coronavirus%202,%20taxid:2697049&CollectionDate_dr= 2020-01-01T00:00:00.00Z%20TO%202026-01-21T23:59:59.00Z&HostLineage_ss= Homo%20sapiens%20(human),%20taxid:9606. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James B...

work page 2020
[16]

Single-class classifier learning using neural networks: An application to the prediction of mineral deposits

Andrew Skabar. Single-class classifier learning using neural networks: An application to the prediction of mineral deposits. InProceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No. 03EX693), volume 4, pages 2127–2132. IEEE,

work page 2003