Recognition: 2 theorem links
· Lean TheoremExploring the limits of pre-trained embeddings in machine-guided protein design: a case study on predicting AAV vector viability
Pith reviewed 2026-05-15 21:47 UTC · model grok-4.3
The pith
Pre-trained protein embeddings require task-specific fine-tuning to reach optimal performance in predicting AAV capsid viability, where sequence-level representations then outperform amino-acid ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Optimal performance in predicting AAV vector viability from pre-trained embeddings occurs exclusively after fine-tuning with task-specific labels, at which stage sequence-level representations deliver the strongest results. Prior to fine-tuning, amino-acid-level embeddings prove superior for supervised prediction while sequence-level embeddings suit unsupervised tasks better. The degree of sequence variation required to induce meaningful changes in the representations surpasses the localized mutations typically tested in bioengineering, indicating that fine-tuning is essential when datasets feature sparse or regionally concentrated changes.
What carries the argument
Comparison of ProtBERT and ESM2 embedding variants as sequence representations for AAV capsid viability, evaluated before and after task-specific fine-tuning while distinguishing amino-acid-level from sequence-level outputs.
If this is right
- Fine-tuning with task labels becomes necessary to extract usable signals from embeddings when mutations are sparse or localized.
- Sequence-level representations gain an advantage over amino-acid-level ones once fine-tuning occurs.
- Pre-trained embeddings alone cannot reliably capture functional effects from the small sequence changes typical in bioengineering.
- Bioengineering experiments may need to introduce larger sequence variations to observe representation shifts without fine-tuning.
Where Pith is reading between the lines
- Similar fine-tuning steps could improve results in other protein design tasks that rely on localized mutations rather than broad sequence changes.
- Hybrid approaches combining frozen pre-trained layers with light task-specific adaptation layers might balance performance and data efficiency.
- New benchmarks focused on mutation sensitivity could help quantify how much sequence change is needed before pre-trained representations shift meaningfully.
Load-bearing premise
The AAV capsid viability dataset and the supervised or unsupervised tasks selected represent general protein bioengineering problems that involve sparse or localized mutations.
What would settle it
Demonstration on a new dataset with similarly localized mutations that pre-trained embeddings achieve equal or superior predictive accuracy without any task-specific fine-tuning would falsify the central claim.
Figures
read the original abstract
Effective representations of protein sequences are widely recognized as a cornerstone of machine learning-based protein design. Yet, protein bioengineering poses unique challenges for sequence representation, as experimental datasets typically feature few mutations, which are either sparsely distributed across the entire sequence or densely concentrated within localized regions. This limits the ability of sequence-level representations to extract functionally meaningful signals. In addition, comprehensive comparative studies remain scarce, despite their crucial role in clarifying which representations best encode relevant information and ultimately support superior predictive performance. In this study, we systematically evaluate multiple ProtBERT and ESM2 embedding variants as sequence representations, using the adeno-associated virus capsid as a case study and prototypical example of bioengineering, where functional optimization is targeted through highly localized sequence variation within an otherwise large protein. Our results reveal that, prior to fine-tuning, amino acid-level embeddings outperform sequence-level representations in supervised predictive tasks, whereas the latter tend to be more effective in unsupervised settings. However, optimal performance is only achieved when embeddings are fine-tuned with task-specific labels, with sequence-level representations providing the best performance. Moreover, our findings indicate that the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies, showing the need for fine-tuning in datasets characterized by sparse or highly localized mutations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates multiple variants of ProtBERT and ESM2 pre-trained embeddings as representations for predicting AAV capsid viability, using this as a case study for protein bioengineering with sparse or localized mutations. Key findings are that amino-acid-level embeddings outperform sequence-level ones in supervised tasks before fine-tuning (while the reverse holds for unsupervised settings), but optimal performance requires task-specific fine-tuning, after which sequence-level representations are best; additionally, the degree of sequence variation needed to induce notable shifts in representations exceeds levels typical in bioengineering studies.
Significance. If the empirical patterns hold under broader validation, the work usefully documents the practical limits of frozen pre-trained embeddings for downstream tasks with limited mutational diversity, and supplies concrete evidence that fine-tuning is often required. This could help steer the field away from over-reliance on off-the-shelf representations in protein-design pipelines.
major comments (2)
- [Abstract / Results] Abstract and Results: The central claim that 'the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies' is load-bearing for the recommendation to fine-tune, yet the manuscript provides no quantitative comparison (e.g., mean or distribution of Hamming distances or number of mutated positions) between the AAV dataset and representative literature values for other localized-mutation engineering campaigns.
- [Methods] Methods / Experimental Setup: The abstract states clear comparative results, but the absence of reported statistical tests, cross-validation details, negative controls, or explicit checks against post-hoc model selection makes it impossible to determine whether the reported performance gaps (pre- vs. post-fine-tuning, amino-acid vs. sequence level) are robust or could be driven by dataset-specific artifacts.
minor comments (1)
- [Methods] Notation for embedding variants (ProtBERT vs. ESM2 sizes, pooling strategies) should be defined once in a table or methods subsection rather than repeated inline.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments, which have identified important opportunities to strengthen the manuscript. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results: The central claim that 'the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies' is load-bearing for the recommendation to fine-tune, yet the manuscript provides no quantitative comparison (e.g., mean or distribution of Hamming distances or number of mutated positions) between the AAV dataset and representative literature values for other localized-mutation engineering campaigns.
Authors: We agree that a direct quantitative comparison would make the claim more robust and actionable. In the revised manuscript we will add a new paragraph (and accompanying table) in the Results section that reports the mean and distribution of Hamming distances and the number of mutated positions in the AAV viability dataset. We will then compare these statistics to representative values drawn from the literature on localized-mutation campaigns (e.g., enzyme active-site engineering and antibody affinity maturation). This addition will provide concrete support for the statement that the mutational diversity in our case study is typical of the bioengineering settings we discuss. revision: yes
-
Referee: [Methods] Methods / Experimental Setup: The abstract states clear comparative results, but the absence of reported statistical tests, cross-validation details, negative controls, or explicit checks against post-hoc model selection makes it impossible to determine whether the reported performance gaps (pre- vs. post-fine-tuning, amino-acid vs. sequence level) are robust or could be driven by dataset-specific artifacts.
Authors: We acknowledge that the current Methods section lacks several elements needed to fully demonstrate statistical robustness. In the revision we will (i) expand the description of the cross-validation procedure (including the number of folds and how train/validation/test splits were constructed), (ii) report the results of paired statistical tests (e.g., Wilcoxon signed-rank or t-tests with appropriate multiple-testing correction) on the performance differences, (iii) include negative-control baselines such as random embeddings and label-shuffled controls, and (iv) clarify the model-selection protocol to show that hyper-parameter choices were fixed prior to final evaluation. These additions will be placed in a dedicated subsection of Methods and referenced in the Results. revision: yes
Circularity Check
No circularity: empirical evaluation of pre-trained embeddings on external AAV dataset
full rationale
The paper performs a systematic empirical comparison of ProtBERT and ESM2 embedding variants on the AAV capsid viability dataset. All performance claims (pre- vs post-fine-tuning, sequence- vs amino-acid-level representations, variation thresholds) are derived directly from supervised/unsupervised task results on held-out data. No derivations, equations, or first-principles results are presented that reduce to fitted parameters or self-referential definitions. Self-citations, if present, support background on embeddings or datasets but are not load-bearing for the central empirical findings. The study is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pre-trained protein language models capture transferable sequence information that can be adapted via fine-tuning to localized mutation tasks
- domain assumption The AAV capsid dataset exemplifies typical bioengineering constraints of sparse or highly localized sequence variation
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
optimal performance is only achieved when embeddings are fine-tuned with task-specific labels, with sequence-level representations providing the best performance. Moreover, the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We systematically evaluate multiple ProtBERT and ESM2 embedding variants as sequence representations, using the adeno-associated virus capsid as a case study
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction Machine learning (ML)-based protein design has become a powerful strategy in modern protein engineering 1. A critical aspect of this approach is selecting an approp riate format to represent the protein sequence as input for the ML model. The optimal representation de pends on factors such as the specific task to be learnt and dataset charact...
work page 2023
-
[2]
Results This study investigates how pre-trained embeddings perform as representations of protein sequences in ML tasks relevant to protein bioengineering, particularly when datase ts contain only small and localized sequence changes, a common scenario in this field. We examine dif ferent embedding variants that can be generated directly by ProtBERT and ES...
work page 2021
-
[3]
Discussion Embeddings generated with pLMs are currently a leading choice as ML-friendly formats to represent protein sequences due to their ability to capture rich, high-dimensional information about sequence context and short- to long-range interactions between amino acids 11 . However, pre-trained embeddings are inherently general-purpose representation...
-
[4]
Methods 4.1. Dataset, data preprocessing, and mutation landscape analysis The data used in this work is part of a dataset published by Bryant et al. (2021) 5, which reports a comprehensive study for machine-guided AA V2 capsid diversification. The set comprises 296,968 variants of the AA V2 capsid protein, both viable (153,691) and non-viable (143,278), a...
work page 2021
-
[5]
Kouba, P. et al. Machine Learning-Guided Protein Engineering. ACS Catal. 13 , 13863–13895 (2023)
work page 2023
-
[6]
Yue, Z.-X. et al. A systematic review on the state-of-the-art strategies f or protein representation. Computers in Biology and Medicine 152 , 106440 (2023)
work page 2023
-
[7]
Harding-Larsen, D. et al. Protein representations: Encoding biological information for machine learning in biocatalysis. Biotechnology Advances 77 , 108459 (2024). 31
work page 2024
-
[8]
Ogden, P. J., Kelsic, E. D., Sinai, S. & Church, G. M. Comprehensive AA V c apsid fitness landscape reveals a viral gene and enables machine-guided design. Science 366 , 1139–1143 (2019)
work page 2019
-
[9]
Bryant, D. H. et al. Deep diversification of an AA V capsid protein by machine learning. Nat Biotechnol 39 , 691–696 (2021)
work page 2021
-
[10]
Marques, A. D. et al. Applying machine learning to predict viral assembly for adeno-associated virus capsid libraries. Molecular Therapy - Methods & Clinical Development 20 , 276–286 (2021)
work page 2021
-
[11]
Han, Z. et al. Computer-Aided Directed Evolution Generates Novel AA V Variants with High Transduction Efficiency. Viruses 15 , 848 (2023)
work page 2023
-
[12]
Mazurenko, S., Prokop, Z. & Damborsky, J. Machine Learning in Enzyme Engineeri ng. ACS Catal. 10 , 1210–1223 (2020)
work page 2020
-
[13]
Rezaee, K. & Eslami, H. Bridging machine learning and peptide design for cancer treatment: a comprehensive review. Artif Intell Rev 58 , 156 (2025)
work page 2025
-
[14]
Tian, P., Louis, J. M., Baber, J. L., Aniana, A. & Best, R. B. Co ‐Evolutionary Fitness Landscapes for Sequence Design. Angew Chem Int Ed 57 , 5674–5678 (2018)
work page 2018
- [16]
-
[18]
Zhao, Z., Alzubaidi, L., Zhang, J., Duan, Y . & Gu, Y . A comparison review of transfer learning and self-supervised learning: Definitions, applications, advantages and lim itations. Expert Systems with Applications 242 , 122807 (2024)
work page 2024
- [19]
-
[21]
Niu, Z., Zhong, G. & Yu, H. A review on the attention mechanism of deep le arning. Neurocomputing 452 , 48–62 (2021)
work page 2021
-
[22]
Rao, R. et al. Evaluating Protein Transfer Learning with TAPE. Adv Neural Inf Process Syst 32 , 9689–9701 (2019)
work page 2019
-
[23]
Capel, H. et al. ProteinGLUE multi-task benchmark suite for self-supervised protein modeling. Sci Rep 12 , 16047 (2022)
work page 2022
-
[24]
Kulmanov, M. et al. Protein function prediction as approximate semantic entailment. Nat Mach Intell 6, 220–228 (2024)
work page 2024
-
[25]
Vu, T. T. D. & Jung, J. Gene Ontology based protein functional annotation using pret rained embeddings. in 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 3893–3895 (IEEE, Las Vegas, NV , USA, 2022). doi:10.1109/BIBM55620.2022.9995108
-
[26]
Villegas-Morcillo, A., Gomez, A. M. & Sanchez, V . An analysis of prot ein language model embeddings for fold prediction. Briefings in Bioinformatics 23 , bbac142 (2022)
work page 2022
-
[27]
Wu, F., Jing, X., Luo, X. & Xu, J. Improving protein structure prediction using templates and sequence embedding. Bioinformatics 39 , btac723 (2023). 33
work page 2023
-
[28]
Gao, Q., Zhang, C., Li, M. & Yu, T. Protein–Protein Interaction Predictio n Model Based on ProtBert-BiGRU-Attention. Journal of Computational Biology 31 , 797–814 (2024)
work page 2024
- [29]
-
[30]
Rives, A. et al. Biological structure and function emerge from scaling unsupervised lear ning to 250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 118 , e2016239118 (2021)
work page 2021
-
[31]
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379 , 1123–1130 (2023)
work page 2023
-
[32]
Hayes, T. et al. Simulating 500 million years of evolution with a language model. Science 387 , 850–858 (2025)
work page 2025
-
[33]
Elnaggar, A. et al. ProtTrans: Toward Understanding the Language of Life Through Self- Supervised Learning. IEEE Trans. Pattern Anal. Mach. Intell. 44 , 7112–7127 (2022)
work page 2022
-
[34]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2019)
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805 2019
-
[36]
Villegas-Morcillo, A. et al. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 37 , 162–170 (2021)
work page 2021
-
[37]
ElAbd, H. et al. Amino acid encoding for deep learning applications. BMC Bioinformatics 21 , (2020). 34
work page 2020
-
[38]
Ginn, S. L., Mandwie, M., Alexander, I. E., Edelstein, M. & Abedi, M. R. Gene therapy clinical trials worldwide to 2023—an update. The Journal of Gene Medicine 26 , e3721 (2024)
work page 2023
-
[39]
Becker, J., Fakhiri, J. & Grimm, D. Fantastic AA V Gene Therapy V ectors and How to Find Them—Random Diversification, Rational Design and Machine Learning. Pathogens 11 , 756 (2022)
work page 2022
-
[40]
Vu Hong, A. et al. An engineered AA V targeting integrin alpha V beta 6 presents impr oved myotropism across species. Nat Commun 15 , 7965 (2024)
work page 2024
-
[41]
King, S. H. et al. Generative design of novel bacteriophages with genome language models. Preprint at https://doi.org/10.1101/2025.09.12.675911 (2025)
-
[42]
Wu, J. et al. Prediction of Adeno-Associated Virus Fitness with a Protein Langua ge-Based Machine Learning Model. Human Gene Therapy 36 , 823–829 (2025)
work page 2025
-
[43]
Eid, F.-E. et al. Systematic multi-trait AA V capsid engineering for efficient gene delivery. Nat Commun 15 , 6602 (2024)
work page 2024
-
[44]
Wu, P. et al. Mutational Analysis of the Adeno-Associated Virus Type 2 (AA V2) Capsid Gene and Construction of AA V2 Vectors with Altered Tropism. J Virol 74 , 8635–8647 (2000)
work page 2000
-
[45]
van der Maaten, L. & Geoffrey, H. Visualizing data using t-sne. Journal of Machine Learning Research 9, 2579–2605 (2008)
work page 2008
-
[46]
A Survey of Clustering Data Mining Techniques
Berkhin, P. A Survey of Clustering Data Mining Techniques. in Grouping Multidimensional Data 25–71 (Springer-Verlag, Berlin/Heidelberg). doi:10.1007/3-540-28349-8_2
-
[47]
Starr, T. N. et al. Shifting mutational constraints in the SARS-CoV-2 receptor-binding domain during viral evolution. Science 377 , 420–424 (2022)
work page 2022
-
[48]
Cheng, P. et al. Zero-shot prediction of mutation effects with multimodal deep representation learning guides protein engineering. Cell Res 34 , 630–647 (2024). 35
work page 2024
-
[49]
Hoang, M. & Singh, M. Locality-aware pooling enhances protein language model performance across varied applications. Bioinformatics 41 , i217–i226 (2025)
work page 2025
-
[50]
Ofer, D., Brandes, N. & Linial, M. The language of proteins: NLP, machine learning & protein sequences. Computational and Structural Biotechnology Journal 19 , 1750–1758 (2021)
work page 2021
- [51]
-
[52]
Su, J. et al. Saprot: Protein language modeling with structure-aware vocabulary. BioRxiv 2023.10. 01.560349 (2023)
work page 2023
-
[53]
Li, M. et al. Prosst: Protein language modeling with quantized structure and disenta ngled attention. Advances in Neural Information Processing Systems 37 , 35700–35726 (2024)
work page 2024
-
[54]
Dickson, A. & Mofrad, M. R. K. Fine-tuning protein embeddings for functiona l similarity evaluation. Bioinformatics 40 , btae445 (2024)
work page 2024
-
[55]
Kang, H. et al. Fine-tuning of BERT Model to Accurately Predict Drug–Target Interac tions. Pharmaceutics 14 , 1710 (2022)
work page 2022
- [56]
-
[57]
NaderiAlizadeh, N. & Singh, R. Aggregating residue-level protein language model embeddings with optimal transport. Bioinformatics Advances 5, vbaf060 (2024)
work page 2024
-
[58]
Anowar, F., Sadaoui, S. & Selim, B. Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, I CA, t-SNE). Computer Science Review 40 , 100378 (2021)
work page 2021
-
[59]
Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nat Commun 9, 2542 (2018). 36
work page 2018
-
[60]
Transformers: State-of-the-Art Natural Language Processing
The Hugging Face Cummunity. Transformers: State-of-the-Art Natural Language Processing. (2019)
work page 2019
-
[61]
Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12 , 2825– 2830 (2011)
work page 2011
-
[62]
Similarity of Neural Network Representations Revisited
Kornblith, S., Norouzi, M., Lee, H. & Hinton, G. Similarity of Neural Net work Representations Revisited. Preprint at https://doi.org/10.48550/ARXIV .1905.00414 (2019)
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 1905
-
[63]
Schönemann, P. H. A Generalized Solution of the Orthogonal Procrustes Probl em. Psychometrika 31 , 1–10 (1966). Funding This work was supported by FCT - Fundação para a Ciência e Tecnologia, I.P. under the LASIGE Research Unit, ref. UID/00408/2025, DOI https://doi.org/10.54499/UID/00408/2025, and partial ly supported by project 41, HfPT: Health from Portu...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.