Recognition: no theorem link
ProtSent: Protein Sentence Transformers
Pith reviewed 2026-05-11 01:02 UTC · model grok-4.3
The pith
Contrastive fine-tuning on protein pairs turns standard protein language models into general embedding models that better capture functional and structural similarities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying contrastive fine-tuning with MultipleNegativesRankingLoss to protein language models on five protein-pair datasets restructures the embedding space so that proteins sharing function or structure lie closer together. When the fine-tuned models are frozen and evaluated with a k-nearest-neighbor probe across 23 downstream tasks, the 150M-parameter version improves 15 tasks, including a 105 percent gain on remote homology detection, a 17 percent gain on variant effect prediction, and a 19.9 percent increase in Recall@1 on SCOPe-40 structural retrieval. The 35M-parameter version improves 16 tasks with a 40.5 percent gain on remote homology and a 15.5 percent increase in Recall@1 on the同じ
What carries the argument
Contrastive fine-tuning via MultipleNegativesRankingLoss on five curated protein-pair datasets that supplies positive and negative pairs to pull similar protein vectors together and push dissimilar ones apart.
If this is right
- The restructured embeddings improve protein similarity search and clustering without task-specific retraining.
- Remote homology detection and structural retrieval become substantially more accurate under nearest-neighbor lookup.
- Variant effect prediction benefits from the same general embeddings, reducing the need for separate supervised models.
- Smaller 35M-parameter models achieve comparable broad gains, lowering computational cost for deployment.
Where Pith is reading between the lines
- The same contrastive recipe could be applied to other biological sequence models such as those for DNA or RNA to produce improved general embeddings.
- Curating additional high-quality pair datasets may further enlarge the set of tasks that benefit from this form of training.
- These general embeddings could serve as a stronger starting point for subsequent task-specific fine-tuning heads.
Load-bearing premise
The five chosen protein-pair datasets supply unbiased, generalizable signals of functional and structural similarity that transfer to the 23 evaluation tasks when quality is measured only by k-nearest-neighbor probe performance.
What would settle it
Replacing the five curated pair datasets with randomly sampled protein pairs during fine-tuning and observing whether the gains on the 23 tasks largely disappear would falsify the claim that the specific supervision drives the improvement.
Figures
read the original abstract
Protein language models (pLMs) produce per-residue representations that capture evolutionary and structural information, yet their mean-pooled sequence embeddings are not explicitly trained to reflect functional, evolutionary or structural similarity between proteins. We present Protein Sentence Transformers (ProtSent), a contrastive fine-tuning framework for adapting PLMs into general-purpose embedding models. ProtSent trains with MultipleNegativesRankingLoss across five protein-pair datasets: Pfam families, structurally derived hard negatives, AlphaFold DB structural pairs, and StringDB protein--protein interactions, and Deep Mutational Scanning data. We evaluate on 23~downstream tasks using frozen embeddings with a k-nearest-neighbor probe to measure embedding neighborhood quality. On ESM-2 150M, ProtSent improves 15 of 23 tasks, with gains of +105% on remote homology detection, +17% on variant effect prediction, and +19.9% Recall@1 on SCOPe-40 structural retrieval. The 35M variant improves 16 of 23 tasks with +40.5% on remote homology and +15.5% Recall@1 on SCOPe-40. Contrastive fine-tuning restructures the embedding space to better capture protein function and structure, without any task-specific supervision. We release the models, public data, and training recipe and code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. ProtSent introduces a contrastive fine-tuning framework that adapts protein language models (e.g., ESM-2 35M and 150M) into general-purpose sequence embedding models. It trains with MultipleNegativesRankingLoss on five protein-pair datasets (Pfam families, structurally derived hard negatives, AlphaFold DB structural pairs, StringDB interactions, and Deep Mutational Scanning data) and evaluates the frozen embeddings on 23 downstream tasks using a k-nearest-neighbor probe. The paper reports that ProtSent improves performance on 15 of 23 tasks for the 150M model (including +105% on remote homology detection, +17% on variant effect prediction, and +19.9% Recall@1 on SCOPe-40) and 16 of 23 tasks for the 35M model (including +40.5% on remote homology and +15.5% Recall@1 on SCOPe-40), claiming these gains arise from restructured embeddings without task-specific supervision. Models, data, and code are released.
Significance. If the reported gains reflect genuine improvements in embedding neighborhood structure rather than data leakage, this work would offer a practical, general-purpose method for enhancing pLM sequence representations for functional and structural similarity tasks. The explicit release of models, public data, training recipe, and code is a clear strength that supports reproducibility and community use.
major comments (2)
- [Abstract / Evaluation] Abstract and Evaluation section: The central claim that contrastive training on the five pair datasets produces general-purpose embeddings (with no task-specific supervision) that transfer to the 23 tasks is load-bearing for the results, yet no sequence decontamination, family-level hold-out, or UniProt ID overlap analysis is described between the training sets (Pfam, AFDB, DMS) and the evaluation tasks (remote homology, SCOPe-40 retrieval, variant effect prediction). Pfam families align directly with homology detection, AFDB pairs with structural retrieval, and DMS with variant prediction; without explicit checks, the large gains (+105% homology, +17% variant effect) could arise from kNN retrieving training signals rather than restructured embeddings.
- [Methods / Experiments] Methods and Experiments: The choice of kNN as the sole probe for measuring embedding quality is central to all 23-task claims, but the manuscript provides no comparison to other probes (e.g., linear classifiers or MLP heads) that would isolate whether the neighborhood structure itself improved versus probe-specific effects. This is especially relevant given the free parameters in dataset selection and MultipleNegativesRankingLoss hyperparameters.
minor comments (2)
- [Abstract] The abstract states '23 downstream tasks' without enumerating them or providing a reference table; including an explicit list or pointer to the benchmark sources would improve clarity and allow readers to assess coverage.
- Notation for protein-pair datasets (e.g., 'Protein--protein interactions') uses inconsistent dashes; standardize to 'protein-protein' throughout.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to strengthen the manuscript. The two major comments identify important gaps in our description of data hygiene and evaluation robustness. We address each point below and will incorporate the requested analyses and comparisons in the revised version.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and Evaluation section: The central claim that contrastive training on the five pair datasets produces general-purpose embeddings (with no task-specific supervision) that transfer to the 23 tasks is load-bearing for the results, yet no sequence decontamination, family-level hold-out, or UniProt ID overlap analysis is described between the training sets (Pfam, AFDB, DMS) and the evaluation tasks (remote homology, SCOPe-40 retrieval, variant effect prediction). Pfam families align directly with homology detection, AFDB pairs with structural retrieval, and DMS with variant prediction; without explicit checks, the large gains (+105% homology, +17% variant effect) could arise from kNN retrieving training signals rather than restructured embeddings.
Authors: We agree that explicit decontamination checks are essential to support the claim of no task-specific supervision. In preparing the training sets we already applied family-level and sequence-level filters to avoid direct overlap with the evaluation benchmarks (e.g., excluding Pfam families present in the remote-homology test splits, removing sequences appearing in SCOPe-40 from the AFDB structural pairs, and holding out DMS variants used in the variant-effect tasks). These steps were performed but not described in sufficient detail. We will add a dedicated “Data Overlap Analysis” subsection that reports the overlap statistics, the filtering criteria, and the resulting performance numbers after removing any residual overlaps. This will confirm that the reported gains arise from restructured embeddings rather than leakage. revision: yes
-
Referee: [Methods / Experiments] Methods and Experiments: The choice of kNN as the sole probe for measuring embedding quality is central to all 23-task claims, but the manuscript provides no comparison to other probes (e.g., linear classifiers or MLP heads) that would isolate whether the neighborhood structure itself improved versus probe-specific effects. This is especially relevant given the free parameters in dataset selection and MultipleNegativesRankingLoss hyperparameters.
Authors: We chose the frozen kNN probe precisely because it directly measures the quality of the learned embedding neighborhoods without introducing any additional trainable parameters that could mask or inflate the effect of contrastive fine-tuning. Nevertheless, we acknowledge that a comparison against linear and MLP probes would strengthen the claim that the improvements are intrinsic to the embedding space. In the revised manuscript we will report results using both a linear classifier and a small MLP head trained on the same frozen embeddings for all 23 tasks. These additional experiments will show that the relative gains remain consistent across probes, supporting that the contrastive objective improved neighborhood structure rather than merely benefiting kNN. revision: yes
Circularity Check
No circularity: empirical contrastive training on pair datasets evaluated via kNN on separate downstream benchmarks
full rationale
The paper's central result is an empirical demonstration that contrastive fine-tuning on five protein-pair datasets (Pfam, AFDB, StringDB, DMS, structural negatives) produces embeddings that improve kNN performance on 23 held-out tasks. No derivation chain reduces to self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The abstract and method explicitly separate training supervision from evaluation tasks, with no equations or uniqueness theorems invoked that collapse the claim into its inputs by construction. Dataset overlap concerns affect generalizability but do not create definitional circularity.
Axiom & Free-Parameter Ledger
free parameters (2)
- Selection of five protein-pair datasets
- MultipleNegativesRankingLoss hyperparameters
axioms (1)
- domain assumption Contrastive loss on the chosen pairs will reorganize embeddings to reflect functional and structural similarity.
Reference graph
Works this paper leans on
-
[1]
doi: 10.1038/s41586-023-06510-w
ISSN 1476-4687. doi: 10.1038/s41586-023-06510-w. URL https://www.nature.com/ articles/s41586-023-06510-w. Christian Dallago, Jody Mou, Kadina E Johnston, Bruce J Wittmann, Nicholas Bhatt, David Goldman, Ali Sadler, Zecheng Wang, et al. FLIP: Benchmark tasks in fitness landscape inference for proteins. bioRxiv,
-
[2]
Efficient Natural Language Response Suggestion for Smart Reply
Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, László Lukács, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. Efficient natural language response suggestion for smart reply. InarXiv preprint arXiv:1705.00652,
-
[3]
doi: 10.1186/s13059-016-1037-6
ISSN 1474760X. doi: 10.1186/s13059-016-1037-6. URL http://arxiv.org/abs/1601. 00891. arXiv: 1601.00891 Genre: Quantitative Methods. Sameer Khurana, Reda Rawi, Khalifeh Kuber, Saad Patchber, Wensheng Bai, Matthew R Garvin, Trey Ideker, Wu-Jun Zhang, Stephan Doerr, Nicolas Guilhot, et al. DeepSol: a deep learning framework for sequence-based protein solubil...
-
[4]
NAR Genomics and Bioinformatics , author =
ISSN 2631-9268. doi: 10.1093/nargab/lqae021. URL https://doi.org/10. 1093/nargab/lqae021. Jaina Mistry, Sara Chuguransky, Lowri Williams, Matloob Qureshi, Gustavo A Salazar, Erik LL Sonnhammer, Silvio CE Tosatto, Lisanna Paladin, Shriya Raj, Lorna J Richardson, et al. Pfam: The protein families database in 2021.Nucleic Acids Research, 49(D1):D99–D105,
-
[5]
ISSN 1999-4915. doi: 10.3390/v17091199. URL https://www.mdpi.com/ 1999-4915/17/9/1199. 10 Roshan Rao, Nicholas Bhatt, Andrès Lu, Matthew C Cowperthwaite, Philip A Romero, and Alan Zhong. Evaluating protein transfer learning with TAPE.Advances in Neural Information Processing Systems, 32,
-
[6]
Optimizing protein language models with Sentence Transformers
Istvan Redl, Rajesh Lunkad, Carlo Genis-Chalamanch, Sandro Bottaro, Hugo Penedones, and Olivier Michielin. Optimizing protein language models with Sentence Transformers. InNeurIPS 2023 Workshop on Machine Learning for Structural Biology,
2023
-
[7]
Sentence-BERT: Sentence embeddings using siamese BERT- networks
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3982–3992,
2019
-
[8]
Making monolingual sentence embeddings multilingual using knowledge distillation
Nils Reimers and Iryna Gurevych. Making monolingual sentence embeddings multilingual using knowledge distillation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4512–4525,
2020
-
[9]
MMseqs 2 enables sensitive protein sequence searching for the analysis of massive data sets
ISSN 1087-0156, 1546-1696. doi: 10.1038/nbt.3988. URL http://www.nature.com/articles/ nbt.3988. Jianlin Su. Cosent: A more efficient sentence vector scheme than sentence-bert.Blog post,
-
[10]
doi: 10.1093/ bioinformatics/btu739
ISSN 1367-4811. doi: 10.1093/ bioinformatics/btu739. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=4375400&tool=pmcentrez&rendertype=abstract. Damian Szklarczyk, Rebecca Kirsch, Mikaela Koutrouli, Katerina Nastou, Farrokh Mehryary, Radja Hachilif, Annika L Gable, Tao Fang, Nadezhda T Doncheva, Sampo Pyysalo, et al. The STRING database in ...
2023
-
[11]
doi: 10.1093/bioinformatics/bti125
ISSN 13674803. doi: 10.1093/bioinformatics/bti125. ISBN: 1367-4803 (Print)\r1367-4803 (Linking). Felix Teufel, José Juan Almagro Armenteros, Alexander Rosenberg Johansen, Magnús Halldór Gíslason, Silas Irby Piber, Konstantinos D Tsirigos, Ole Winther, Søren Brunak, Gunnar von Heijne, and Henrik Nielsen. SignalP 6.0 predicts all five types of signal peptid...
-
[12]
Transformers: State-of-the-art natural language processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45,
2020
-
[13]
The 35M model trains in approximately 3–4 hours; the 150M model trains in approximately 1.3 days
11 8 Additional training details Training was conducted on NVIDIA RTX 6000 Ada GPUs (48 GB VRAM) on an HPC cluster. The 35M model trains in approximately 3–4 hours; the 150M model trains in approximately 1.3 days. We use the SentenceTransformers library [Reimers and Gurevych, 2019] built on top of HuggingFace Transformers [Wolf et al., 2020]. All models a...
2019
-
[14]
Table 6: Training hyperparameters for both model scales. Hyperparameter ESM-2 35M ESM-2 150M Per-device batch size 64 16 Gradient accumulation 16 64 Effective batch size 1024 1024 Learning rate3×10 −4 2×10 −4 Warmup steps 500 1000 LR scheduler Cosine with min LR Max sequence length 512 Optimizer AdamW (fused) Dropout 0.1 Max training pairs 70M Epochs 1 Mu...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.