pith. machine review for the scientific record. sign in

arxiv: 2602.14828 · v1 · submitted 2026-02-16 · 🧬 q-bio.QM · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Exploring the limits of pre-trained embeddings in machine-guided protein design: a case study on predicting AAV vector viability

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:47 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.LG
keywords protein embeddingspre-trained modelsAAV capsidmachine learningprotein designfine-tuningsequence representationsviability prediction
0
0 comments X

The pith

Pre-trained protein embeddings require task-specific fine-tuning to reach optimal performance in predicting AAV capsid viability, where sequence-level representations then outperform amino-acid ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests multiple variants of ProtBERT and ESM2 embeddings as representations for protein sequences in the context of AAV capsid viability prediction, a typical bioengineering case with highly localized mutations. It shows that amino-acid-level embeddings lead in supervised tasks before any tuning, while sequence-level ones work better in unsupervised settings. Peak accuracy arrives only after fine-tuning on task labels, at which point sequence-level representations take the lead. The work also demonstrates that the amount of sequence change needed to produce clear shifts in these representations exceeds the sparse variations common in protein engineering experiments, establishing the necessity of fine-tuning for such datasets.

Core claim

Optimal performance in predicting AAV vector viability from pre-trained embeddings occurs exclusively after fine-tuning with task-specific labels, at which stage sequence-level representations deliver the strongest results. Prior to fine-tuning, amino-acid-level embeddings prove superior for supervised prediction while sequence-level embeddings suit unsupervised tasks better. The degree of sequence variation required to induce meaningful changes in the representations surpasses the localized mutations typically tested in bioengineering, indicating that fine-tuning is essential when datasets feature sparse or regionally concentrated changes.

What carries the argument

Comparison of ProtBERT and ESM2 embedding variants as sequence representations for AAV capsid viability, evaluated before and after task-specific fine-tuning while distinguishing amino-acid-level from sequence-level outputs.

If this is right

  • Fine-tuning with task labels becomes necessary to extract usable signals from embeddings when mutations are sparse or localized.
  • Sequence-level representations gain an advantage over amino-acid-level ones once fine-tuning occurs.
  • Pre-trained embeddings alone cannot reliably capture functional effects from the small sequence changes typical in bioengineering.
  • Bioengineering experiments may need to introduce larger sequence variations to observe representation shifts without fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar fine-tuning steps could improve results in other protein design tasks that rely on localized mutations rather than broad sequence changes.
  • Hybrid approaches combining frozen pre-trained layers with light task-specific adaptation layers might balance performance and data efficiency.
  • New benchmarks focused on mutation sensitivity could help quantify how much sequence change is needed before pre-trained representations shift meaningfully.

Load-bearing premise

The AAV capsid viability dataset and the supervised or unsupervised tasks selected represent general protein bioengineering problems that involve sparse or localized mutations.

What would settle it

Demonstration on a new dataset with similarly localized mutations that pre-trained embeddings achieve equal or superior predictive accuracy without any task-specific fine-tuning would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.14828 by Ana F. Rodrigues, Catia Pesquita, Laura Balbi, Lucas Ferraz, Pedro Giesteira Cotovio.

Figure 2
Figure 2. Figure 2: Hierarchical clustering of the different representation formats evaluated in this study, colored by viability (orange = viable, black = non-viable), and annotated by design strategy (ML- or non-ML-designed sequences). Note the scale differences between OHE and embedding-based representations [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE plots of the different representation formats, colored by viability (left panels) or by design strategy (right panels) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Composition of sequence groups. Groups of sequences sharing functional features (classification difficulty and correctness across all model-representation pairs featuring the specified representation format) are shown in terms of viability (left panel) and design strategy (right panel). To further investigate the origin of these outcomes, the mutational landscape of each group, i.e., the frequency and type… view at source ↗
Figure 5
Figure 5. Figure 5: Mutational landscape analysis. Distribution of mutation types (deletions, insertions, and substitutions) across amino acid positions 561 to 588 in the targeted region, expressed as the percentage of change relative to the reference sequence. Data are presented for the following sequence groups: non-ML￾designed, ML-designed, viable, and non-viable, easy and difficult sequences. For difficult sequences, we c… view at source ↗
Figure 7
Figure 7. Figure 7: t-SNE plots of the different ProtBERT and ESM2 embedding variants after fine-tuning, colored by viability (top panels) or by design strategy (bottom panels) [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cumulative variance explained by the top components of fine-tuning-induced changes in the embedding space for different embedding types. The dashed line indicates the 95% cumulative variance threshold [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

Effective representations of protein sequences are widely recognized as a cornerstone of machine learning-based protein design. Yet, protein bioengineering poses unique challenges for sequence representation, as experimental datasets typically feature few mutations, which are either sparsely distributed across the entire sequence or densely concentrated within localized regions. This limits the ability of sequence-level representations to extract functionally meaningful signals. In addition, comprehensive comparative studies remain scarce, despite their crucial role in clarifying which representations best encode relevant information and ultimately support superior predictive performance. In this study, we systematically evaluate multiple ProtBERT and ESM2 embedding variants as sequence representations, using the adeno-associated virus capsid as a case study and prototypical example of bioengineering, where functional optimization is targeted through highly localized sequence variation within an otherwise large protein. Our results reveal that, prior to fine-tuning, amino acid-level embeddings outperform sequence-level representations in supervised predictive tasks, whereas the latter tend to be more effective in unsupervised settings. However, optimal performance is only achieved when embeddings are fine-tuned with task-specific labels, with sequence-level representations providing the best performance. Moreover, our findings indicate that the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies, showing the need for fine-tuning in datasets characterized by sparse or highly localized mutations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper evaluates multiple variants of ProtBERT and ESM2 pre-trained embeddings as representations for predicting AAV capsid viability, using this as a case study for protein bioengineering with sparse or localized mutations. Key findings are that amino-acid-level embeddings outperform sequence-level ones in supervised tasks before fine-tuning (while the reverse holds for unsupervised settings), but optimal performance requires task-specific fine-tuning, after which sequence-level representations are best; additionally, the degree of sequence variation needed to induce notable shifts in representations exceeds levels typical in bioengineering studies.

Significance. If the empirical patterns hold under broader validation, the work usefully documents the practical limits of frozen pre-trained embeddings for downstream tasks with limited mutational diversity, and supplies concrete evidence that fine-tuning is often required. This could help steer the field away from over-reliance on off-the-shelf representations in protein-design pipelines.

major comments (2)
  1. [Abstract / Results] Abstract and Results: The central claim that 'the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies' is load-bearing for the recommendation to fine-tune, yet the manuscript provides no quantitative comparison (e.g., mean or distribution of Hamming distances or number of mutated positions) between the AAV dataset and representative literature values for other localized-mutation engineering campaigns.
  2. [Methods] Methods / Experimental Setup: The abstract states clear comparative results, but the absence of reported statistical tests, cross-validation details, negative controls, or explicit checks against post-hoc model selection makes it impossible to determine whether the reported performance gaps (pre- vs. post-fine-tuning, amino-acid vs. sequence level) are robust or could be driven by dataset-specific artifacts.
minor comments (1)
  1. [Methods] Notation for embedding variants (ProtBERT vs. ESM2 sizes, pooling strategies) should be defined once in a table or methods subsection rather than repeated inline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which have identified important opportunities to strengthen the manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results: The central claim that 'the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies' is load-bearing for the recommendation to fine-tune, yet the manuscript provides no quantitative comparison (e.g., mean or distribution of Hamming distances or number of mutated positions) between the AAV dataset and representative literature values for other localized-mutation engineering campaigns.

    Authors: We agree that a direct quantitative comparison would make the claim more robust and actionable. In the revised manuscript we will add a new paragraph (and accompanying table) in the Results section that reports the mean and distribution of Hamming distances and the number of mutated positions in the AAV viability dataset. We will then compare these statistics to representative values drawn from the literature on localized-mutation campaigns (e.g., enzyme active-site engineering and antibody affinity maturation). This addition will provide concrete support for the statement that the mutational diversity in our case study is typical of the bioengineering settings we discuss. revision: yes

  2. Referee: [Methods] Methods / Experimental Setup: The abstract states clear comparative results, but the absence of reported statistical tests, cross-validation details, negative controls, or explicit checks against post-hoc model selection makes it impossible to determine whether the reported performance gaps (pre- vs. post-fine-tuning, amino-acid vs. sequence level) are robust or could be driven by dataset-specific artifacts.

    Authors: We acknowledge that the current Methods section lacks several elements needed to fully demonstrate statistical robustness. In the revision we will (i) expand the description of the cross-validation procedure (including the number of folds and how train/validation/test splits were constructed), (ii) report the results of paired statistical tests (e.g., Wilcoxon signed-rank or t-tests with appropriate multiple-testing correction) on the performance differences, (iii) include negative-control baselines such as random embeddings and label-shuffled controls, and (iv) clarify the model-selection protocol to show that hyper-parameter choices were fixed prior to final evaluation. These additions will be placed in a dedicated subsection of Methods and referenced in the Results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of pre-trained embeddings on external AAV dataset

full rationale

The paper performs a systematic empirical comparison of ProtBERT and ESM2 embedding variants on the AAV capsid viability dataset. All performance claims (pre- vs post-fine-tuning, sequence- vs amino-acid-level representations, variation thresholds) are derived directly from supervised/unsupervised task results on held-out data. No derivations, equations, or first-principles results are presented that reduce to fitted parameters or self-referential definitions. Self-citations, if present, support background on embeddings or datasets but are not load-bearing for the central empirical findings. The study is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard domain assumptions about the transferability of general protein language model embeddings to specific bioengineering tasks and the representativeness of the AAV case study for localized mutation scenarios. No free parameters are explicitly introduced or fitted beyond standard model training, and no new entities are postulated.

axioms (2)
  • domain assumption Pre-trained protein language models capture transferable sequence information that can be adapted via fine-tuning to localized mutation tasks
    Invoked throughout the evaluation of ProtBERT and ESM2 variants before and after fine-tuning.
  • domain assumption The AAV capsid dataset exemplifies typical bioengineering constraints of sparse or highly localized sequence variation
    Used as the prototypical case study to draw general conclusions about embedding limits.

pith-pipeline@v0.9.0 · 5555 in / 1396 out tokens · 54951 ms · 2026-05-15T21:47:45.909871+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    optimal performance is only achieved when embeddings are fine-tuned with task-specific labels, with sequence-level representations providing the best performance. Moreover, the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies.

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We systematically evaluate multiple ProtBERT and ESM2 embedding variants as sequence representations, using the adeno-associated virus capsid as a case study

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 2 internal anchors

  1. [1]

    A critical aspect of this approach is selecting an approp riate format to represent the protein sequence as input for the ML model

    Introduction Machine learning (ML)-based protein design has become a powerful strategy in modern protein engineering 1. A critical aspect of this approach is selecting an approp riate format to represent the protein sequence as input for the ML model. The optimal representation de pends on factors such as the specific task to be learnt and dataset charact...

  2. [2]

    Results This study investigates how pre-trained embeddings perform as representations of protein sequences in ML tasks relevant to protein bioengineering, particularly when datase ts contain only small and localized sequence changes, a common scenario in this field. We examine dif ferent embedding variants that can be generated directly by ProtBERT and ES...

  3. [3]

    However, pre-trained embeddings are inherently general-purpose representations that capture the br oad grammar of protein language

    Discussion Embeddings generated with pLMs are currently a leading choice as ML-friendly formats to represent protein sequences due to their ability to capture rich, high-dimensional information about sequence context and short- to long-range interactions between amino acids 11 . However, pre-trained embeddings are inherently general-purpose representation...

  4. [4]

    representations

    Methods 4.1. Dataset, data preprocessing, and mutation landscape analysis The data used in this work is part of a dataset published by Bryant et al. (2021) 5, which reports a comprehensive study for machine-guided AA V2 capsid diversification. The set comprises 296,968 variants of the AA V2 capsid protein, both viable (153,691) and non-viable (143,278), a...

  5. [5]

    Kouba, P. et al. Machine Learning-Guided Protein Engineering. ACS Catal. 13 , 13863–13895 (2023)

  6. [6]

    Yue, Z.-X. et al. A systematic review on the state-of-the-art strategies f or protein representation. Computers in Biology and Medicine 152 , 106440 (2023)

  7. [7]

    Harding-Larsen, D. et al. Protein representations: Encoding biological information for machine learning in biocatalysis. Biotechnology Advances 77 , 108459 (2024). 31

  8. [8]

    J., Kelsic, E

    Ogden, P. J., Kelsic, E. D., Sinai, S. & Church, G. M. Comprehensive AA V c apsid fitness landscape reveals a viral gene and enables machine-guided design. Science 366 , 1139–1143 (2019)

  9. [9]

    Bryant, D. H. et al. Deep diversification of an AA V capsid protein by machine learning. Nat Biotechnol 39 , 691–696 (2021)

  10. [10]

    Marques, A. D. et al. Applying machine learning to predict viral assembly for adeno-associated virus capsid libraries. Molecular Therapy - Methods & Clinical Development 20 , 276–286 (2021)

  11. [11]

    Han, Z. et al. Computer-Aided Directed Evolution Generates Novel AA V Variants with High Transduction Efficiency. Viruses 15 , 848 (2023)

  12. [12]

    & Damborsky, J

    Mazurenko, S., Prokop, Z. & Damborsky, J. Machine Learning in Enzyme Engineeri ng. ACS Catal. 10 , 1210–1223 (2020)

  13. [13]

    & Eslami, H

    Rezaee, K. & Eslami, H. Bridging machine learning and peptide design for cancer treatment: a comprehensive review. Artif Intell Rev 58 , 156 (2025)

  14. [14]

    M., Baber, J

    Tian, P., Louis, J. M., Baber, J. L., Aniana, A. & Best, R. B. Co ‐Evolutionary Fitness Landscapes for Sequence Design. Angew Chem Int Ed 57 , 5674–5678 (2018)

  15. [16]

    & Zou, Q

    Cui, F., Zhang, Z. & Zou, Q. Sequence representation approaches for sequenc e-based protein prediction tasks that use deep learning. Briefings in Functional Genomics 20 , 61–73 (2021)

  16. [18]

    Zhao, Z., Alzubaidi, L., Zhang, J., Duan, Y . & Gu, Y . A comparison review of transfer learning and self-supervised learning: Definitions, applications, advantages and lim itations. Expert Systems with Applications 242 , 122807 (2024)

  17. [19]

    & Rost, B

    Schmirler, R., Heinzinger, M. & Rost, B. Fine-tuning protein language model s boosts predictions across diverse tasks. Nat Commun 15 , (2024)

  18. [21]

    Niu, Z., Zhong, G. & Yu, H. A review on the attention mechanism of deep le arning. Neurocomputing 452 , 48–62 (2021)

  19. [22]

    Rao, R. et al. Evaluating Protein Transfer Learning with TAPE. Adv Neural Inf Process Syst 32 , 9689–9701 (2019)

  20. [23]

    Capel, H. et al. ProteinGLUE multi-task benchmark suite for self-supervised protein modeling. Sci Rep 12 , 16047 (2022)

  21. [24]

    Kulmanov, M. et al. Protein function prediction as approximate semantic entailment. Nat Mach Intell 6, 220–228 (2024)

  22. [25]

    Vu, T. T. D. & Jung, J. Gene Ontology based protein functional annotation using pret rained embeddings. in 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 3893–3895 (IEEE, Las Vegas, NV , USA, 2022). doi:10.1109/BIBM55620.2022.9995108

  23. [26]

    Villegas-Morcillo, A., Gomez, A. M. & Sanchez, V . An analysis of prot ein language model embeddings for fold prediction. Briefings in Bioinformatics 23 , bbac142 (2022)

  24. [27]

    Wu, F., Jing, X., Luo, X. & Xu, J. Improving protein structure prediction using templates and sequence embedding. Bioinformatics 39 , btac723 (2023). 33

  25. [28]

    Gao, Q., Zhang, C., Li, M. & Yu, T. Protein–Protein Interaction Predictio n Model Based on ProtBert-BiGRU-Attention. Journal of Computational Biology 31 , 797–814 (2024)

  26. [29]

    & Lim, C

    Sargsyan, K. & Lim, C. Using protein language models for protein interac tion hot spot prediction with limited data. BMC Bioinformatics 25 , 115 (2024)

  27. [30]

    Rives, A. et al. Biological structure and function emerge from scaling unsupervised lear ning to 250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 118 , e2016239118 (2021)

  28. [31]

    Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379 , 1123–1130 (2023)

  29. [32]

    Hayes, T. et al. Simulating 500 million years of evolution with a language model. Science 387 , 850–858 (2025)

  30. [33]

    Elnaggar, A. et al. ProtTrans: Toward Understanding the Language of Life Through Self- Supervised Learning. IEEE Trans. Pattern Anal. Mach. Intell. 44 , 7112–7127 (2022)

  31. [34]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2019)

  32. [36]

    Villegas-Morcillo, A. et al. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 37 , 162–170 (2021)

  33. [37]

    ElAbd, H. et al. Amino acid encoding for deep learning applications. BMC Bioinformatics 21 , (2020). 34

  34. [38]

    L., Mandwie, M., Alexander, I

    Ginn, S. L., Mandwie, M., Alexander, I. E., Edelstein, M. & Abedi, M. R. Gene therapy clinical trials worldwide to 2023—an update. The Journal of Gene Medicine 26 , e3721 (2024)

  35. [39]

    & Grimm, D

    Becker, J., Fakhiri, J. & Grimm, D. Fantastic AA V Gene Therapy V ectors and How to Find Them—Random Diversification, Rational Design and Machine Learning. Pathogens 11 , 756 (2022)

  36. [40]

    Vu Hong, A. et al. An engineered AA V targeting integrin alpha V beta 6 presents impr oved myotropism across species. Nat Commun 15 , 7965 (2024)

  37. [41]

    King, S. H. et al. Generative design of novel bacteriophages with genome language models. Preprint at https://doi.org/10.1101/2025.09.12.675911 (2025)

  38. [42]

    Wu, J. et al. Prediction of Adeno-Associated Virus Fitness with a Protein Langua ge-Based Machine Learning Model. Human Gene Therapy 36 , 823–829 (2025)

  39. [43]

    Eid, F.-E. et al. Systematic multi-trait AA V capsid engineering for efficient gene delivery. Nat Commun 15 , 6602 (2024)

  40. [44]

    Wu, P. et al. Mutational Analysis of the Adeno-Associated Virus Type 2 (AA V2) Capsid Gene and Construction of AA V2 Vectors with Altered Tropism. J Virol 74 , 8635–8647 (2000)

  41. [45]

    & Geoffrey, H

    van der Maaten, L. & Geoffrey, H. Visualizing data using t-sne. Journal of Machine Learning Research 9, 2579–2605 (2008)

  42. [46]

    A Survey of Clustering Data Mining Techniques

    Berkhin, P. A Survey of Clustering Data Mining Techniques. in Grouping Multidimensional Data 25–71 (Springer-Verlag, Berlin/Heidelberg). doi:10.1007/3-540-28349-8_2

  43. [47]

    Starr, T. N. et al. Shifting mutational constraints in the SARS-CoV-2 receptor-binding domain during viral evolution. Science 377 , 420–424 (2022)

  44. [48]

    Cheng, P. et al. Zero-shot prediction of mutation effects with multimodal deep representation learning guides protein engineering. Cell Res 34 , 630–647 (2024). 35

  45. [49]

    & Singh, M

    Hoang, M. & Singh, M. Locality-aware pooling enhances protein language model performance across varied applications. Bioinformatics 41 , i217–i226 (2025)

  46. [50]

    & Linial, M

    Ofer, D., Brandes, N. & Linial, M. The language of proteins: NLP, machine learning & protein sequences. Computational and Structural Biotechnology Journal 19 , 1750–1758 (2021)

  47. [51]

    K., Wu, Z

    Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution f or protein engineering. Nat Methods 16 , 687–694 (2019)

  48. [52]

    Su, J. et al. Saprot: Protein language modeling with structure-aware vocabulary. BioRxiv 2023.10. 01.560349 (2023)

  49. [53]

    Li, M. et al. Prosst: Protein language modeling with quantized structure and disenta ngled attention. Advances in Neural Information Processing Systems 37 , 35700–35726 (2024)

  50. [54]

    & Mofrad, M

    Dickson, A. & Mofrad, M. R. K. Fine-tuning protein embeddings for functiona l similarity evaluation. Bioinformatics 40 , btae445 (2024)

  51. [55]

    Kang, H. et al. Fine-tuning of BERT Model to Accurately Predict Drug–Target Interac tions. Pharmaceutics 14 , 1710 (2022)

  52. [56]

    & Luo, X

    Li, Y ., Zou, Q., Dai, Q., Stalin, A. & Luo, X. Identifying the DNA methylation preference of transcription factors using ProtBERT and SVM. PLoS Comput Biol 21 , e1012513 (2025)

  53. [57]

    & Singh, R

    NaderiAlizadeh, N. & Singh, R. Aggregating residue-level protein language model embeddings with optimal transport. Bioinformatics Advances 5, vbaf060 (2024)

  54. [58]

    & Selim, B

    Anowar, F., Sadaoui, S. & Selim, B. Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, I CA, t-SNE). Computer Science Review 40 , 100378 (2021)

  55. [59]

    & Söding, J

    Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nat Commun 9, 2542 (2018). 36

  56. [60]

    Transformers: State-of-the-Art Natural Language Processing

    The Hugging Face Cummunity. Transformers: State-of-the-Art Natural Language Processing. (2019)

  57. [61]

    Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12 , 2825– 2830 (2011)

  58. [62]

    Similarity of Neural Network Representations Revisited

    Kornblith, S., Norouzi, M., Lee, H. & Hinton, G. Similarity of Neural Net work Representations Revisited. Preprint at https://doi.org/10.48550/ARXIV .1905.00414 (2019)

  59. [63]

    Schönemann, P. H. A Generalized Solution of the Orthogonal Procrustes Probl em. Psychometrika 31 , 1–10 (1966). Funding This work was supported by FCT - Fundação para a Ciência e Tecnologia, I.P. under the LASIGE Research Unit, ref. UID/00408/2025, DOI https://doi.org/10.54499/UID/00408/2025, and partial ly supported by project 41, HfPT: Health from Portu...