arxiv: 2602.14828 · v1 · submitted 2026-02-16 · 🧬 q-bio.QM · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Exploring the limits of pre-trained embeddings in machine-guided protein design: a case study on predicting AAV vector viability

Ana F. Rodrigues , Lucas Ferraz , Laura Balbi , Pedro Giesteira Cotovio , Catia Pesquita

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:47 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.LG

keywords protein embeddingspre-trained modelsAAV capsidmachine learningprotein designfine-tuningsequence representationsviability prediction

0 comments

The pith

Pre-trained protein embeddings require task-specific fine-tuning to reach optimal performance in predicting AAV capsid viability, where sequence-level representations then outperform amino-acid ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests multiple variants of ProtBERT and ESM2 embeddings as representations for protein sequences in the context of AAV capsid viability prediction, a typical bioengineering case with highly localized mutations. It shows that amino-acid-level embeddings lead in supervised tasks before any tuning, while sequence-level ones work better in unsupervised settings. Peak accuracy arrives only after fine-tuning on task labels, at which point sequence-level representations take the lead. The work also demonstrates that the amount of sequence change needed to produce clear shifts in these representations exceeds the sparse variations common in protein engineering experiments, establishing the necessity of fine-tuning for such datasets.

Core claim

Optimal performance in predicting AAV vector viability from pre-trained embeddings occurs exclusively after fine-tuning with task-specific labels, at which stage sequence-level representations deliver the strongest results. Prior to fine-tuning, amino-acid-level embeddings prove superior for supervised prediction while sequence-level embeddings suit unsupervised tasks better. The degree of sequence variation required to induce meaningful changes in the representations surpasses the localized mutations typically tested in bioengineering, indicating that fine-tuning is essential when datasets feature sparse or regionally concentrated changes.

What carries the argument

Comparison of ProtBERT and ESM2 embedding variants as sequence representations for AAV capsid viability, evaluated before and after task-specific fine-tuning while distinguishing amino-acid-level from sequence-level outputs.

If this is right

Fine-tuning with task labels becomes necessary to extract usable signals from embeddings when mutations are sparse or localized.
Sequence-level representations gain an advantage over amino-acid-level ones once fine-tuning occurs.
Pre-trained embeddings alone cannot reliably capture functional effects from the small sequence changes typical in bioengineering.
Bioengineering experiments may need to introduce larger sequence variations to observe representation shifts without fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar fine-tuning steps could improve results in other protein design tasks that rely on localized mutations rather than broad sequence changes.
Hybrid approaches combining frozen pre-trained layers with light task-specific adaptation layers might balance performance and data efficiency.
New benchmarks focused on mutation sensitivity could help quantify how much sequence change is needed before pre-trained representations shift meaningfully.

Load-bearing premise

The AAV capsid viability dataset and the supervised or unsupervised tasks selected represent general protein bioengineering problems that involve sparse or localized mutations.

What would settle it

Demonstration on a new dataset with similarly localized mutations that pre-trained embeddings achieve equal or superior predictive accuracy without any task-specific fine-tuning would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.14828 by Ana F. Rodrigues, Catia Pesquita, Laura Balbi, Lucas Ferraz, Pedro Giesteira Cotovio.

**Figure 2.** Figure 2: Hierarchical clustering of the different representation formats evaluated in this study, colored by viability (orange = viable, black = non-viable), and annotated by design strategy (ML- or non-ML-designed sequences). Note the scale differences between OHE and embedding-based representations [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: t-SNE plots of the different representation formats, colored by viability (left panels) or by design strategy (right panels) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Composition of sequence groups. Groups of sequences sharing functional features (classification difficulty and correctness across all model-representation pairs featuring the specified representation format) are shown in terms of viability (left panel) and design strategy (right panel). To further investigate the origin of these outcomes, the mutational landscape of each group, i.e., the frequency and type… view at source ↗

**Figure 5.** Figure 5: Mutational landscape analysis. Distribution of mutation types (deletions, insertions, and substitutions) across amino acid positions 561 to 588 in the targeted region, expressed as the percentage of change relative to the reference sequence. Data are presented for the following sequence groups: non-MLdesigned, ML-designed, viable, and non-viable, easy and difficult sequences. For difficult sequences, we c… view at source ↗

**Figure 7.** Figure 7: t-SNE plots of the different ProtBERT and ESM2 embedding variants after fine-tuning, colored by viability (top panels) or by design strategy (bottom panels) [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Cumulative variance explained by the top components of fine-tuning-induced changes in the embedding space for different embedding types. The dashed line indicates the 95% cumulative variance threshold [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

Effective representations of protein sequences are widely recognized as a cornerstone of machine learning-based protein design. Yet, protein bioengineering poses unique challenges for sequence representation, as experimental datasets typically feature few mutations, which are either sparsely distributed across the entire sequence or densely concentrated within localized regions. This limits the ability of sequence-level representations to extract functionally meaningful signals. In addition, comprehensive comparative studies remain scarce, despite their crucial role in clarifying which representations best encode relevant information and ultimately support superior predictive performance. In this study, we systematically evaluate multiple ProtBERT and ESM2 embedding variants as sequence representations, using the adeno-associated virus capsid as a case study and prototypical example of bioengineering, where functional optimization is targeted through highly localized sequence variation within an otherwise large protein. Our results reveal that, prior to fine-tuning, amino acid-level embeddings outperform sequence-level representations in supervised predictive tasks, whereas the latter tend to be more effective in unsupervised settings. However, optimal performance is only achieved when embeddings are fine-tuned with task-specific labels, with sequence-level representations providing the best performance. Moreover, our findings indicate that the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies, showing the need for fine-tuning in datasets characterized by sparse or highly localized mutations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuning is required for best results on this AAV task and sequence-level embeddings win after it, but the claim about exceeding typical bioengineering variation rests on thin evidence.

read the letter

The paper's main result is that amino acid-level embeddings from ProtBERT and ESM2 outperform sequence-level ones on supervised AAV viability prediction before any fine-tuning, while sequence-level versions are stronger in unsupervised settings. After task-specific fine-tuning, sequence-level representations deliver the highest performance. They also report that the mutation scale needed to shift sequence representations noticeably is larger than the localized changes common in bioengineering work, so fine-tuning becomes necessary for sparse-variation datasets like this one. This pattern is new as a direct head-to-head on a real capsid engineering dataset. The work does a clean job of running the same models in both pre-trained and fine-tuned modes and separating supervised from unsupervised use cases, which clarifies practical choices for people facing similar low-diversity protein data. The comparison is systematic enough to be useful as a case study. The soft spot is the single-protein scope. The claim that typical bioengineering variation falls short depends on AAV being representative, yet the abstract gives no quantitative mutation counts or Hamming-distance benchmarks against other systems, and the stress-test concern about generalizability holds up without those checks. Methods details on statistical tests, controls, and dataset construction are also missing from what is visible, so it is hard to judge robustness. This is worth a serious referee for groups doing ML on protein engineering with limited sequence diversity; the empirical comparison is grounded and the practical takeaway is clear even if the broader generalization needs more data.

Referee Report

2 major / 1 minor

Summary. The paper evaluates multiple variants of ProtBERT and ESM2 pre-trained embeddings as representations for predicting AAV capsid viability, using this as a case study for protein bioengineering with sparse or localized mutations. Key findings are that amino-acid-level embeddings outperform sequence-level ones in supervised tasks before fine-tuning (while the reverse holds for unsupervised settings), but optimal performance requires task-specific fine-tuning, after which sequence-level representations are best; additionally, the degree of sequence variation needed to induce notable shifts in representations exceeds levels typical in bioengineering studies.

Significance. If the empirical patterns hold under broader validation, the work usefully documents the practical limits of frozen pre-trained embeddings for downstream tasks with limited mutational diversity, and supplies concrete evidence that fine-tuning is often required. This could help steer the field away from over-reliance on off-the-shelf representations in protein-design pipelines.

major comments (2)

[Abstract / Results] Abstract and Results: The central claim that 'the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies' is load-bearing for the recommendation to fine-tune, yet the manuscript provides no quantitative comparison (e.g., mean or distribution of Hamming distances or number of mutated positions) between the AAV dataset and representative literature values for other localized-mutation engineering campaigns.
[Methods] Methods / Experimental Setup: The abstract states clear comparative results, but the absence of reported statistical tests, cross-validation details, negative controls, or explicit checks against post-hoc model selection makes it impossible to determine whether the reported performance gaps (pre- vs. post-fine-tuning, amino-acid vs. sequence level) are robust or could be driven by dataset-specific artifacts.

minor comments (1)

[Methods] Notation for embedding variants (ProtBERT vs. ESM2 sizes, pooling strategies) should be defined once in a table or methods subsection rather than repeated inline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which have identified important opportunities to strengthen the manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results: The central claim that 'the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies' is load-bearing for the recommendation to fine-tune, yet the manuscript provides no quantitative comparison (e.g., mean or distribution of Hamming distances or number of mutated positions) between the AAV dataset and representative literature values for other localized-mutation engineering campaigns.

Authors: We agree that a direct quantitative comparison would make the claim more robust and actionable. In the revised manuscript we will add a new paragraph (and accompanying table) in the Results section that reports the mean and distribution of Hamming distances and the number of mutated positions in the AAV viability dataset. We will then compare these statistics to representative values drawn from the literature on localized-mutation campaigns (e.g., enzyme active-site engineering and antibody affinity maturation). This addition will provide concrete support for the statement that the mutational diversity in our case study is typical of the bioengineering settings we discuss. revision: yes
Referee: [Methods] Methods / Experimental Setup: The abstract states clear comparative results, but the absence of reported statistical tests, cross-validation details, negative controls, or explicit checks against post-hoc model selection makes it impossible to determine whether the reported performance gaps (pre- vs. post-fine-tuning, amino-acid vs. sequence level) are robust or could be driven by dataset-specific artifacts.

Authors: We acknowledge that the current Methods section lacks several elements needed to fully demonstrate statistical robustness. In the revision we will (i) expand the description of the cross-validation procedure (including the number of folds and how train/validation/test splits were constructed), (ii) report the results of paired statistical tests (e.g., Wilcoxon signed-rank or t-tests with appropriate multiple-testing correction) on the performance differences, (iii) include negative-control baselines such as random embeddings and label-shuffled controls, and (iv) clarify the model-selection protocol to show that hyper-parameter choices were fixed prior to final evaluation. These additions will be placed in a dedicated subsection of Methods and referenced in the Results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of pre-trained embeddings on external AAV dataset

full rationale

The paper performs a systematic empirical comparison of ProtBERT and ESM2 embedding variants on the AAV capsid viability dataset. All performance claims (pre- vs post-fine-tuning, sequence- vs amino-acid-level representations, variation thresholds) are derived directly from supervised/unsupervised task results on held-out data. No derivations, equations, or first-principles results are presented that reduce to fitted parameters or self-referential definitions. Self-citations, if present, support background on embeddings or datasets but are not load-bearing for the central empirical findings. The study is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard domain assumptions about the transferability of general protein language model embeddings to specific bioengineering tasks and the representativeness of the AAV case study for localized mutation scenarios. No free parameters are explicitly introduced or fitted beyond standard model training, and no new entities are postulated.

axioms (2)

domain assumption Pre-trained protein language models capture transferable sequence information that can be adapted via fine-tuning to localized mutation tasks
Invoked throughout the evaluation of ProtBERT and ESM2 variants before and after fine-tuning.
domain assumption The AAV capsid dataset exemplifies typical bioengineering constraints of sparse or highly localized sequence variation
Used as the prototypical case study to draw general conclusions about embedding limits.

pith-pipeline@v0.9.0 · 5555 in / 1396 out tokens · 54951 ms · 2026-05-15T21:47:45.909871+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

optimal performance is only achieved when embeddings are fine-tuned with task-specific labels, with sequence-level representations providing the best performance. Moreover, the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We systematically evaluate multiple ProtBERT and ESM2 embedding variants as sequence representations, using the adeno-associated virus capsid as a case study

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 2 internal anchors

[1]

A critical aspect of this approach is selecting an approp riate format to represent the protein sequence as input for the ML model

Introduction Machine learning (ML)-based protein design has become a powerful strategy in modern protein engineering 1. A critical aspect of this approach is selecting an approp riate format to represent the protein sequence as input for the ML model. The optimal representation de pends on factors such as the specific task to be learnt and dataset charact...

work page 2023
[2]

Results This study investigates how pre-trained embeddings perform as representations of protein sequences in ML tasks relevant to protein bioengineering, particularly when datase ts contain only small and localized sequence changes, a common scenario in this field. We examine dif ferent embedding variants that can be generated directly by ProtBERT and ES...

work page 2021
[3]

However, pre-trained embeddings are inherently general-purpose representations that capture the br oad grammar of protein language

Discussion Embeddings generated with pLMs are currently a leading choice as ML-friendly formats to represent protein sequences due to their ability to capture rich, high-dimensional information about sequence context and short- to long-range interactions between amino acids 11 . However, pre-trained embeddings are inherently general-purpose representation...

work page
[4]

representations

Methods 4.1. Dataset, data preprocessing, and mutation landscape analysis The data used in this work is part of a dataset published by Bryant et al. (2021) 5, which reports a comprehensive study for machine-guided AA V2 capsid diversification. The set comprises 296,968 variants of the AA V2 capsid protein, both viable (153,691) and non-viable (143,278), a...

work page 2021
[5]

Kouba, P. et al. Machine Learning-Guided Protein Engineering. ACS Catal. 13 , 13863–13895 (2023)

work page 2023
[6]

Yue, Z.-X. et al. A systematic review on the state-of-the-art strategies f or protein representation. Computers in Biology and Medicine 152 , 106440 (2023)

work page 2023
[7]

Harding-Larsen, D. et al. Protein representations: Encoding biological information for machine learning in biocatalysis. Biotechnology Advances 77 , 108459 (2024). 31

work page 2024
[8]

J., Kelsic, E

Ogden, P. J., Kelsic, E. D., Sinai, S. & Church, G. M. Comprehensive AA V c apsid fitness landscape reveals a viral gene and enables machine-guided design. Science 366 , 1139–1143 (2019)

work page 2019
[9]

Bryant, D. H. et al. Deep diversification of an AA V capsid protein by machine learning. Nat Biotechnol 39 , 691–696 (2021)

work page 2021
[10]

Marques, A. D. et al. Applying machine learning to predict viral assembly for adeno-associated virus capsid libraries. Molecular Therapy - Methods & Clinical Development 20 , 276–286 (2021)

work page 2021
[11]

Han, Z. et al. Computer-Aided Directed Evolution Generates Novel AA V Variants with High Transduction Efficiency. Viruses 15 , 848 (2023)

work page 2023
[12]

& Damborsky, J

Mazurenko, S., Prokop, Z. & Damborsky, J. Machine Learning in Enzyme Engineeri ng. ACS Catal. 10 , 1210–1223 (2020)

work page 2020
[13]

& Eslami, H

Rezaee, K. & Eslami, H. Bridging machine learning and peptide design for cancer treatment: a comprehensive review. Artif Intell Rev 58 , 156 (2025)

work page 2025
[14]

M., Baber, J

Tian, P., Louis, J. M., Baber, J. L., Aniana, A. & Best, R. B. Co ‐Evolutionary Fitness Landscapes for Sequence Design. Angew Chem Int Ed 57 , 5674–5678 (2018)

work page 2018
[16]

& Zou, Q

Cui, F., Zhang, Z. & Zou, Q. Sequence representation approaches for sequenc e-based protein prediction tasks that use deep learning. Briefings in Functional Genomics 20 , 61–73 (2021)

work page 2021
[18]

Zhao, Z., Alzubaidi, L., Zhang, J., Duan, Y . & Gu, Y . A comparison review of transfer learning and self-supervised learning: Definitions, applications, advantages and lim itations. Expert Systems with Applications 242 , 122807 (2024)

work page 2024
[19]

& Rost, B

Schmirler, R., Heinzinger, M. & Rost, B. Fine-tuning protein language model s boosts predictions across diverse tasks. Nat Commun 15 , (2024)

work page 2024
[21]

Niu, Z., Zhong, G. & Yu, H. A review on the attention mechanism of deep le arning. Neurocomputing 452 , 48–62 (2021)

work page 2021
[22]

Rao, R. et al. Evaluating Protein Transfer Learning with TAPE. Adv Neural Inf Process Syst 32 , 9689–9701 (2019)

work page 2019
[23]

Capel, H. et al. ProteinGLUE multi-task benchmark suite for self-supervised protein modeling. Sci Rep 12 , 16047 (2022)

work page 2022
[24]

Kulmanov, M. et al. Protein function prediction as approximate semantic entailment. Nat Mach Intell 6, 220–228 (2024)

work page 2024
[25]

Vu, T. T. D. & Jung, J. Gene Ontology based protein functional annotation using pret rained embeddings. in 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 3893–3895 (IEEE, Las Vegas, NV , USA, 2022). doi:10.1109/BIBM55620.2022.9995108

work page doi:10.1109/bibm55620.2022.9995108 2022
[26]

Villegas-Morcillo, A., Gomez, A. M. & Sanchez, V . An analysis of prot ein language model embeddings for fold prediction. Briefings in Bioinformatics 23 , bbac142 (2022)

work page 2022
[27]

Wu, F., Jing, X., Luo, X. & Xu, J. Improving protein structure prediction using templates and sequence embedding. Bioinformatics 39 , btac723 (2023). 33

work page 2023
[28]

Gao, Q., Zhang, C., Li, M. & Yu, T. Protein–Protein Interaction Predictio n Model Based on ProtBert-BiGRU-Attention. Journal of Computational Biology 31 , 797–814 (2024)

work page 2024
[29]

& Lim, C

Sargsyan, K. & Lim, C. Using protein language models for protein interac tion hot spot prediction with limited data. BMC Bioinformatics 25 , 115 (2024)

work page 2024
[30]

Rives, A. et al. Biological structure and function emerge from scaling unsupervised lear ning to 250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 118 , e2016239118 (2021)

work page 2021
[31]

Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379 , 1123–1130 (2023)

work page 2023
[32]

Hayes, T. et al. Simulating 500 million years of evolution with a language model. Science 387 , 850–858 (2025)

work page 2025
[33]

Elnaggar, A. et al. ProtTrans: Toward Understanding the Language of Life Through Self- Supervised Learning. IEEE Trans. Pattern Anal. Mach. Intell. 44 , 7112–7127 (2022)

work page 2022
[34]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2019)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805 2019
[36]

Villegas-Morcillo, A. et al. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 37 , 162–170 (2021)

work page 2021
[37]

ElAbd, H. et al. Amino acid encoding for deep learning applications. BMC Bioinformatics 21 , (2020). 34

work page 2020
[38]

L., Mandwie, M., Alexander, I

Ginn, S. L., Mandwie, M., Alexander, I. E., Edelstein, M. & Abedi, M. R. Gene therapy clinical trials worldwide to 2023—an update. The Journal of Gene Medicine 26 , e3721 (2024)

work page 2023
[39]

& Grimm, D

Becker, J., Fakhiri, J. & Grimm, D. Fantastic AA V Gene Therapy V ectors and How to Find Them—Random Diversification, Rational Design and Machine Learning. Pathogens 11 , 756 (2022)

work page 2022
[40]

Vu Hong, A. et al. An engineered AA V targeting integrin alpha V beta 6 presents impr oved myotropism across species. Nat Commun 15 , 7965 (2024)

work page 2024
[41]

King, S. H. et al. Generative design of novel bacteriophages with genome language models. Preprint at https://doi.org/10.1101/2025.09.12.675911 (2025)

work page doi:10.1101/2025.09.12.675911 2025
[42]

Wu, J. et al. Prediction of Adeno-Associated Virus Fitness with a Protein Langua ge-Based Machine Learning Model. Human Gene Therapy 36 , 823–829 (2025)

work page 2025
[43]

Eid, F.-E. et al. Systematic multi-trait AA V capsid engineering for efficient gene delivery. Nat Commun 15 , 6602 (2024)

work page 2024
[44]

Wu, P. et al. Mutational Analysis of the Adeno-Associated Virus Type 2 (AA V2) Capsid Gene and Construction of AA V2 Vectors with Altered Tropism. J Virol 74 , 8635–8647 (2000)

work page 2000
[45]

& Geoffrey, H

van der Maaten, L. & Geoffrey, H. Visualizing data using t-sne. Journal of Machine Learning Research 9, 2579–2605 (2008)

work page 2008
[46]

A Survey of Clustering Data Mining Techniques

Berkhin, P. A Survey of Clustering Data Mining Techniques. in Grouping Multidimensional Data 25–71 (Springer-Verlag, Berlin/Heidelberg). doi:10.1007/3-540-28349-8_2

work page doi:10.1007/3-540-28349-8_2
[47]

Starr, T. N. et al. Shifting mutational constraints in the SARS-CoV-2 receptor-binding domain during viral evolution. Science 377 , 420–424 (2022)

work page 2022
[48]

Cheng, P. et al. Zero-shot prediction of mutation effects with multimodal deep representation learning guides protein engineering. Cell Res 34 , 630–647 (2024). 35

work page 2024
[49]

& Singh, M

Hoang, M. & Singh, M. Locality-aware pooling enhances protein language model performance across varied applications. Bioinformatics 41 , i217–i226 (2025)

work page 2025
[50]

& Linial, M

Ofer, D., Brandes, N. & Linial, M. The language of proteins: NLP, machine learning & protein sequences. Computational and Structural Biotechnology Journal 19 , 1750–1758 (2021)

work page 2021
[51]

K., Wu, Z

Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution f or protein engineering. Nat Methods 16 , 687–694 (2019)

work page 2019
[52]

Su, J. et al. Saprot: Protein language modeling with structure-aware vocabulary. BioRxiv 2023.10. 01.560349 (2023)

work page 2023
[53]

Li, M. et al. Prosst: Protein language modeling with quantized structure and disenta ngled attention. Advances in Neural Information Processing Systems 37 , 35700–35726 (2024)

work page 2024
[54]

& Mofrad, M

Dickson, A. & Mofrad, M. R. K. Fine-tuning protein embeddings for functiona l similarity evaluation. Bioinformatics 40 , btae445 (2024)

work page 2024
[55]

Kang, H. et al. Fine-tuning of BERT Model to Accurately Predict Drug–Target Interac tions. Pharmaceutics 14 , 1710 (2022)

work page 2022
[56]

& Luo, X

Li, Y ., Zou, Q., Dai, Q., Stalin, A. & Luo, X. Identifying the DNA methylation preference of transcription factors using ProtBERT and SVM. PLoS Comput Biol 21 , e1012513 (2025)

work page 2025
[57]

& Singh, R

NaderiAlizadeh, N. & Singh, R. Aggregating residue-level protein language model embeddings with optimal transport. Bioinformatics Advances 5, vbaf060 (2024)

work page 2024
[58]

& Selim, B

Anowar, F., Sadaoui, S. & Selim, B. Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, I CA, t-SNE). Computer Science Review 40 , 100378 (2021)

work page 2021
[59]

& Söding, J

Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nat Commun 9, 2542 (2018). 36

work page 2018
[60]

Transformers: State-of-the-Art Natural Language Processing

The Hugging Face Cummunity. Transformers: State-of-the-Art Natural Language Processing. (2019)

work page 2019
[61]

Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12 , 2825– 2830 (2011)

work page 2011
[62]

Similarity of Neural Network Representations Revisited

Kornblith, S., Norouzi, M., Lee, H. & Hinton, G. Similarity of Neural Net work Representations Revisited. Preprint at https://doi.org/10.48550/ARXIV .1905.00414 (2019)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 1905
[63]

Schönemann, P. H. A Generalized Solution of the Orthogonal Procrustes Probl em. Psychometrika 31 , 1–10 (1966). Funding This work was supported by FCT - Fundação para a Ciência e Tecnologia, I.P. under the LASIGE Research Unit, ref. UID/00408/2025, DOI https://doi.org/10.54499/UID/00408/2025, and partial ly supported by project 41, HfPT: Health from Portu...

work page doi:10.54499/uid/00408/2025 1966