Recognition: unknown
scpFormer: A Foundation Model for Unified Representation and Integration of the Single-Cell Proteomics
Pith reviewed 2026-05-10 00:17 UTC · model grok-4.3
The pith
scpFormer unifies single-cell proteomic representations across variable antibody panels using continuous sequence-anchored tokenization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
scpFormer is a transformer-based foundation model pre-trained on over 390 million cells. It replaces standard index-based tokenization with a continuous, sequence-anchored approach that combines Evolutionary Scale Modeling (ESM) with value-aware expression embeddings to dynamically map variable panels into a shared semantic space without artificial discretization. This generates global cell representations that perform competitively in large-scale batch integration and unsupervised clustering. Its open-vocabulary architecture facilitates in silico panel expansion to assist reconstruction of biological manifolds in sparse clinical datasets and transfers the learned protein co-expression logic
What carries the argument
continuous sequence-anchored tokenization combined with ESM-value embeddings that maps variable antibody panels into a shared semantic space
Load-bearing premise
Pre-training on 390 million cells with continuous sequence-anchored tokenization and ESM-value embeddings produces representations that generalize across arbitrary antibody panels and transfer meaningfully to bulk-omics without panel-specific retraining or adjustments.
What would settle it
Applying scpFormer to integration of single-cell proteomics datasets that use entirely novel antibody panels and finding that its batch correction or clustering metrics fall below those of standard panel-alignment methods.
read the original abstract
The integration of single-cell proteomic data is often hindered by the fragmented nature of targeted antibody panels. To address this limitation, we introduce scpFormer, a transformer-based foundation model designed for single-cell proteomics. Pre-trained on over 390 million cells, scpFormer replaces standard index-based tokenization with a continuous, sequence-anchored approach. By combining Evolutionary Scale Modeling (ESM) with value-aware expression embeddings, it dynamically maps variable panels into a shared semantic space without artificial discretization. We demonstrate that scpFormer generates global cell representations that perform competitively in large-scale batch integration and unsupervised clustering. Moreover, its open-vocabulary architecture facilitates in silico panel expansion, assisting in the reconstruction of biological manifolds in sparse clinical datasets. Finally, this learned protein co-expression logic is transferable to bulk-omics tasks, supporting applications like cancer drug response prediction. scpFormer provides a versatile, panel-agnostic framework to facilitate scalable biomarker discovery and precision oncology.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces scpFormer, a transformer-based foundation model for single-cell proteomics pre-trained on over 390 million cells. It replaces index-based tokenization with a continuous sequence-anchored approach that combines ESM with value-aware expression embeddings to map variable antibody panels into a shared semantic space. The paper claims that the resulting global cell representations perform competitively in large-scale batch integration and unsupervised clustering, that the open-vocabulary design enables in silico panel expansion to reconstruct biological manifolds in sparse clinical datasets, and that the learned protein co-expression logic transfers to bulk-omics tasks such as cancer drug response prediction.
Significance. If the empirical claims are substantiated with rigorous benchmarks, scpFormer would represent a meaningful step toward panel-agnostic integration of single-cell proteomic data and cross-modality transfer, which could accelerate biomarker discovery and precision oncology applications. The scale of pre-training (390 million cells) and the attempt at continuous rather than discretized embeddings are positive technical features that distinguish it from prior index-based approaches.
major comments (3)
- [Abstract] Abstract: the assertions that scpFormer 'perform competitively in large-scale batch integration and unsupervised clustering' and that 'this learned protein co-expression logic is transferable to bulk-omics tasks' are presented without any quantitative metrics, baselines, error bars, ablation results, or dataset identifiers. This absence prevents verification of the central performance and transferability claims.
- [Model Architecture] Model Architecture section: the description of the continuous sequence-anchored tokenization and ESM-value embeddings does not specify the handling of non-overlapping markers across panels or the normalization/alignment procedure required to map single-cell targeted proteomics values to bulk measurements. Without an explicit mechanism, the panel-agnostic and zero-shot transfer claims rest on an unverified assumption.
- [Results] Results section (transfer experiments): the claim that the model supports cancer drug response prediction via transfer to bulk-omics lacks reported performance numbers (e.g., AUC, Pearson correlation), comparison baselines (e.g., direct bulk-trained models or other single-cell transfer methods), and statistical tests. This is load-bearing for the transferability assertion.
minor comments (1)
- [Abstract] Abstract: the phrasing 'replaces standard index-based tokenization with a continuous, sequence-anchored approach' is repeated in slightly different wording later; standardize the terminology for clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which have helped us identify areas where the manuscript can be strengthened for clarity and rigor. We address each major comment point by point below, indicating the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertions that scpFormer 'perform competitively in large-scale batch integration and unsupervised clustering' and that 'this learned protein co-expression logic is transferable to bulk-omics tasks' are presented without any quantitative metrics, baselines, error bars, ablation results, or dataset identifiers. This absence prevents verification of the central performance and transferability claims.
Authors: We agree that the abstract presents the claims at a high level without quantitative support, which limits immediate verifiability. The Results section of the full manuscript contains the supporting metrics (including integration scores, clustering ARI/NMI values, baselines, error bars, ablations, and dataset identifiers) as well as transfer performance details. Due to abstract length constraints, we prioritized conveying the overall contributions. In the revision, we will update the abstract to include a small number of key quantitative highlights (e.g., competitive clustering metrics and transfer AUC) drawn directly from the existing results, while maintaining readability. revision: partial
-
Referee: [Model Architecture] Model Architecture section: the description of the continuous sequence-anchored tokenization and ESM-value embeddings does not specify the handling of non-overlapping markers across panels or the normalization/alignment procedure required to map single-cell targeted proteomics values to bulk measurements. Without an explicit mechanism, the panel-agnostic and zero-shot transfer claims rest on an unverified assumption.
Authors: We thank the referee for noting the need for greater explicitness. The continuous embedding strategy relies on ESM protein representations to place all markers (overlapping or not) into a shared semantic space, allowing the transformer to process variable panels without index collisions. For normalization and bulk alignment, per-panel standardization is applied before feeding values into the value-aware embedding layer, with a subsequent linear projection to match bulk scale distributions. We have expanded the Model Architecture section with a new subsection that formally describes these steps, including pseudocode for non-overlapping marker handling and the exact normalization/alignment pipeline used in the transfer experiments. revision: yes
-
Referee: [Results] Results section (transfer experiments): the claim that the model supports cancer drug response prediction via transfer to bulk-omics lacks reported performance numbers (e.g., AUC, Pearson correlation), comparison baselines (e.g., direct bulk-trained models or other single-cell transfer methods), and statistical tests. This is load-bearing for the transferability assertion.
Authors: We acknowledge that the transfer subsection would benefit from more prominent and complete reporting of the quantitative results. The manuscript already contains AUC and correlation values for the drug-response task along with baseline comparisons, but these were not sufficiently highlighted or accompanied by error bars and statistical tests. In the revision, we will add a dedicated table summarizing AUC, Pearson r, direct bulk-trained baselines, other single-cell transfer methods, standard deviations across folds, and p-values from appropriate statistical tests, thereby making the evidence for transferability fully verifiable. revision: yes
Circularity Check
No circularity: empirical pre-training claims do not reduce to self-defined inputs
full rationale
The paper describes a transformer-based foundation model pre-trained on 390 million cells using continuous sequence-anchored tokenization and ESM-value embeddings. All performance claims (batch integration, unsupervised clustering, in silico panel expansion, and transfer to bulk-omics) are presented as outcomes of this empirical pre-training and subsequent evaluations on downstream tasks. No equations, derivations, or first-principles results are introduced that would equate a claimed prediction to a fitted parameter or self-referential definition by construction. The architecture is described as panel-agnostic by design, but this is an empirical assertion supported by the pre-training corpus rather than a tautological reduction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
free parameters (1)
- Transformer model size and training hyperparameters
axioms (1)
- domain assumption Transformer architectures can extract meaningful co-expression logic from large unlabeled single-cell proteomic data
Reference graph
Works this paper leans on
-
[1]
Perkel, J. M. Single-cell proteomics takes centre stage.Nature597, 580–582 (2021). URL https://doi.org/10.1038/d41586-021-02530-6. PMID: 34545225
-
[2]
Kelly, R. T. Single-cell proteomics: progress and prospects.Molecular & Cellular Proteomics19, 1739–1748 (2020)
2020
-
[3]
M., Stephenson, W., Rose, C
Bennett, H. M., Stephenson, W., Rose, C. M. & Darmanis, S. Single-cell pro- teomics enabled by next-generation sequencing or mass spectrometry.Nature Methods20, 363–374 (2023)
2023
-
[4]
Mund, A.et al.Deep visual proteomics defines single-cell identity and hetero- geneity.Nature Biotechnology40, 1231–1240 (2022). 18
2022
-
[5]
M., Okholm, T
Guldberg, S. M., Okholm, T. L. H., McCarthy, E. E. & Spitzer, M. H. Com- putational methods for single-cell proteomics.Annual review of biomedical data science6, 47–71 (2023)
2023
-
[6]
Counting protein molecules for single-cell proteomics.Cell185, 232– 234 (2022)
Slavov, N. Counting protein molecules for single-cell proteomics.Cell185, 232– 234 (2022)
2022
-
[7]
A.et al.Multiplexed single-cell proteomics using scope2.Nature protocols16, 5398–5425 (2021)
Petelski, A. A.et al.Multiplexed single-cell proteomics using scope2.Nature protocols16, 5398–5425 (2021)
2021
-
[8]
& Kelly, R
Truong, T. & Kelly, R. T. What’s new in single-cell proteomics.Current opinion in biotechnology86, 103077 (2024)
2024
-
[9]
Ye, Z.et al.Enhanced sensitivity and scalability with a chip-tip workflow enables deep single-cell proteomics.Nature methods22, 499–509 (2025)
2025
-
[10]
Gatto, L.et al.Initial recommendations for performing, benchmarking and reporting single-cell proteomics experiments.Nature methods20, 375–386 (2023)
2023
-
[11]
Mali, S. B. Single cell proteomics. potential applications in head and neck oncology.Oral Oncology146, 106586 (2023)
2023
-
[12]
M.et al.Single-cell immune landscape of human atherosclerotic plaques.Nature medicine25, 1576–1588 (2019)
Fernandez, D. M.et al.Single-cell immune landscape of human atherosclerotic plaques.Nature medicine25, 1576–1588 (2019)
2019
-
[13]
& Zhou, S
Li, M., Zuo, J., Yang, K., Wang, P. & Zhou, S. Proteomics mining of cancer hallmarks on a single-cell resolution.Mass spectrometry reviews43, 1019–1040 (2024)
2024
-
[14]
A.et al.Spatial single-cell mass spectrometry defines zonation of the hepatocyte proteome.Nature Methods20, 1530–1536 (2023)
Rosenberger, F. A.et al.Spatial single-cell mass spectrometry defines zonation of the hepatocyte proteome.Nature Methods20, 1530–1536 (2023)
2023
-
[15]
Furtw¨ angler, B.et al.Mapping early human blood cell differentiation using single- cell proteomics and transcriptomics.Science390, eadr8785 (2025)
2025
-
[16]
Nature Biotechnology1–14 (2026)
Wu, T.et al.Single-cell proteomic landscape of the developing human brain. Nature Biotechnology1–14 (2026)
2026
-
[17]
Vaswani, A.et al.Attention is all you need.Advances in neural information processing systems30(2017)
2017
-
[18]
Wang, H.et al.Scientific discovery in the age of artificial intelligence.Nature 620, 47–60 (2023)
2023
-
[19]
Nature616, 259–265 (2023)
Moor, M.et al.Foundation models for generalist medical artificial intelligence. Nature616, 259–265 (2023). 19
2023
-
[20]
Cui, H.et al.scgpt: toward building a foundation model for single-cell multi-omics using generative ai.Nature methods21, 1470–1480 (2024)
2024
-
[21]
V.et al.Transfer learning enables predictions in network biology
Theodoris, C. V.et al.Transfer learning enables predictions in network biology. Nature618, 616–624 (2023)
2023
-
[22]
& Davuluri, R
Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics37, 2112–2120 (2021)
2021
-
[23]
Dalla-Torre, H.et al.Nucleotide transformer: building and evaluating robust foundation models for human genomics.Nature Methods22, 287–297 (2025)
2025
-
[24]
Nguyen, E.et al.Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution.Advances in neural information processing systems36, 43177–43201 (2023)
2023
-
[25]
Sanabria, M., Hirsch, J., Joubert, P. M. & Poetsch, A. R. Dna language model grover learns sequence context in the human genome.Nature Machine Intelligence 6, 911–923 (2024)
2024
-
[26]
Lin, Z.et al.Evolutionary-scale prediction of atomic-level protein structure with a language model.Science379, 1123–1130 (2023)
2023
-
[27]
Science387, 850–858 (2025)
Hayes, T.et al.Simulating 500 million years of evolution with a language model. Science387, 850–858 (2025)
2025
-
[28]
nature596, 583–589 (2021)
Jumper, J.et al.Highly accurate protein structure prediction with alphafold. nature596, 583–589 (2021)
2021
-
[29]
Abramson, J.et al.Accurate structure prediction of biomolecular interactions with alphafold 3.Nature630, 493–500 (2024)
2024
-
[30]
Zhao, Q.et al.Deciphering cellular complexity: advances and future directions in single-cell protein analysis.Frontiers in bioengineering and biotechnology12, 1507460 (2025)
2025
-
[31]
Hao, M.et al.Large-scale foundation model on single-cell transcriptomics.Nature methods21, 1481–1491 (2024)
2024
-
[32]
H.et al.Data-driven phenotypic dissection of aml reveals progenitor- like cells that correlate with prognosis.Cell162, 184–197 (2015)
Levine, J. H.et al.Data-driven phenotypic dissection of aml reveals progenitor- like cells that correlate with prognosis.Cell162, 184–197 (2015)
2015
-
[33]
& Satija, R
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single- cell transcriptomic data across different conditions, technologies, and species. Nature biotechnology36, 411–420 (2018). 20
2018
-
[34]
Dom´ ınguez Conde, C.et al.Cross-tissue immune cell analysis reveals tissue- specific features in humans.Science376, eabl5197 (2022)
2022
-
[35]
Hao, Y.et al.Integrated analysis of multimodal single-cell data.Cell184, 3573–3587 (2021)
2021
-
[36]
Korsunsky, I.et al.Fast, sensitive and accurate integration of single-cell data with harmony.Nature methods16, 1289–1296 (2019)
2019
-
[37]
E., Li, C
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical bayes methods.Biostatistics8, 118–127 (2007)
2007
-
[38]
Zheng, Y.et al.Adtnorm: robust integration of single-cell protein measurement across cite-seq datasets.Nature Communications16, 5852 (2025)
2025
-
[39]
Stuart, T.et al.Comprehensive integration of single-cell data.cell177, 1888– 1902 (2019)
1902
-
[40]
D.et al.Benchmarking atlas-level data integration in single-cell genomics.Nature methods19, 41–50 (2022)
Luecken, M. D.et al.Benchmarking atlas-level data integration in single-cell genomics.Nature methods19, 41–50 (2022)
2022
-
[41]
Ramaswamy, A.et al.Immune dysregulation and autoreactivity correlate with disease severity in sars-cov-2-associated multisystem inflammatory syndrome in children.Immunity54, 1083–1095 (2021)
2021
-
[42]
& Zhou, M
Liu, Q., Hu, Z., Jiang, R. & Zhou, M. Deepcdr: a hybrid graph convolutional network for predicting cancer drug response.Bioinformatics36, i911–i918 (2020)
2020
-
[43]
Nucleic Acids Research52, D552–D561 (2024)
Lian, X.et al.Singpro: a knowledge base providing single-cell proteomic data. Nucleic Acids Research52, D552–D561 (2024)
2024
-
[44]
21 6 Figures 22 Fig
Wang, F.et al.Spdb: a comprehensive resource and knowledgebase for proteomic data at the single-cell resolution.Nucleic acids research52, D562–D571 (2024). 21 6 Figures 22 Fig. 1:Overview of the scpFormer framework for single-cell proteomics. A, Large-scale data curation and pre-training corpus construction. Single-cell proteomics datasets were aggregated...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.