pith. sign in

arxiv: 2606.12838 · v1 · pith:MVTAGSH3new · submitted 2026-06-11 · 🧬 q-bio.QM · cs.AI· cs.LG· q-bio.GN

OCOO-T : A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction

Pith reviewed 2026-06-27 05:20 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.AIcs.LGq-bio.GN
keywords virtual celltranscriptional perturbationflow matchingtransformersingle-cell omicsperturbation responsedrug discoverygene expression prediction
0
0 comments X

The pith

A vanilla Transformer with flow-matching and adaptive normalization predicts single-cell transcriptional responses to perturbations at state-of-the-art accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a minimalist flow-matching model built on a standard Transformer stack can predict how cells change their gene expression after genetic, chemical, or cytokine perturbations. It does so by treating the response as a continuous denoising process and injecting perturbation type, dosage, and cell identity only through adaptive layer normalization plus in-context tokens. This approach avoids the auxiliary encoders, hierarchical VAEs, or gene-interaction graphs used in prior work. A sympathetic reader would care because simpler architectures could make large-scale virtual-cell simulations practical for drug discovery and regulatory-network inference. Evaluations on Tahoe100M, Replogle, and PBMC data show the model matches or exceeds existing methods while scaling to long expression profiles via patching.

Core claim

OCOO-T formulates transcriptional perturbation response prediction as a continuous-time flow-matching denoising task performed by a vanilla Transformer that operates directly on continuous gene-expression vectors; perturbation embeddings, dosage, and cell specificity are supplied solely through adaptive layer normalization and in-context tokens, enabling state-of-the-art accuracy across diverse perturbations and cell types on Tahoe100M, Replogle, and PBMC benchmarks together with linear scaling to long profiles through patching and depatching.

What carries the argument

Vanilla Transformer stack performing flow-matching denoising on continuous gene-expression profiles, conditioned by adaptive layer normalization and in-context tokens.

If this is right

  • The model scales linearly to full-length transcriptional profiles by patching and depatching cellular contexts.
  • Performance remains competitive across genetic, chemical, and cytokine perturbations as well as multiple cell types.
  • Architectural complexity can be reduced while preserving or improving accuracy on existing single-cell perturbation benchmarks.
  • In-silico cellular simulation becomes feasible at larger scale because the design avoids dedicated encoder-decoder modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the minimalist conditioning proves sufficient, explicit gene-interaction graphs may be unnecessary for many perturbation-prediction tasks.
  • The same patching strategy could be tested on other high-dimensional single-cell modalities such as chromatin accessibility or protein abundance.
  • Training cost and iteration speed for virtual-cell models would drop if the vanilla-Transformer baseline continues to match specialized architectures.

Load-bearing premise

That perturbation type, dosage, and cell identity supplied only through adaptive layer normalization and in-context tokens are sufficient to capture relevant biological response dynamics without gene-interaction priors or hierarchical encoders.

What would settle it

A new benchmark dataset containing strong, previously unseen gene-regulatory interactions where any method that explicitly encodes those interactions significantly outperforms OCOO-T on held-out perturbations.

Figures

Figures reproduced from arXiv: 2606.12838 by Danning Jiang, Lipeng Lai, Yalong Zhao, Zheming An.

Figure 1
Figure 1. Figure 1: Overview of OCOO-T . A continuous expression profile is denoised by Transformer blocks conditioned on perturbation identity and cellular context. • SwiGLU FFN. OCOO-T replaces the standard ReLU feed-forward network with a SwiGLU feed-forward layer. The gated activation SiLU(W1x) ⊙ W2x improves the expressivity of the Transformer block while preserving a simple and scalable architecture. • RMSNorm. OCOO-T u… view at source ↗
Figure 2
Figure 2. Figure 2: Control cells (xc) are injected along with the input for perturbation prediction. • Genetic perturbations in the Replogle-Nadig benchmark share the same embedding space as gene tokens; in our experiments, these gene embeddings are initialized from ESM2 representations. • Cytokine stimulations in the PBMC benchmark are represented using ESM2 protein embeddings. For protein complexes, the final representatio… view at source ↗
Figure 3
Figure 3. Figure 3: Patching enables the modeling of long-panel genes. Directly applying self-attention to full transcriptomic profiles is computationally expensive, because the cost of self-attention grows quadratically with sequence length, and the sequence length itself scales with the number of modeled genes. To make OCOO-T applicable to long gene panels, we adopt a simple patching and depatching strategy along the gene d… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of multi-dimensional performance across benchmarks. We compare OCOO-T with the following systems: • PerturbDiff is a conditional diffusion model that generates the distribution of perturbed single-cell transcriptomes by denoising from a reference control cell population [9]. Two variants are presented: PerturbDiff (Scratch) is trained end-to-end using the perturbation data, while PerturbDiff … view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of the benchmark results between v-prediction and x-prediction under different patch sizes. vpred/xpred, v/x-prediction; p8/16/32, patch size 8/16/32. 4.4 Cellular Context Conditioning: Covariate Embeddings vs. Mean Control-Cell Profiles Cellular context is a central conditioning signal for perturbation response prediction, because the same perturbation can induce substantially different transcr… view at source ↗
Figure 6
Figure 6. Figure 6: Comparisons of different cellular context injection methods on Raplogle-Nadig benchmark. S1: cell-line embeddings; S2–S7: mean control-cell profiles with set sizes 1, 4, 8, 16, 32, and 64, respectively [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
read the original abstract

Predicting single-cell transcriptional responses to genetic, chemical and cytokine perturbations is a fundamental challenge in computational biology and AI Virtual Cell (AIVC) modeling, with direct implications for drug discovery and the elucidation of gene regulatory networks. Existing approaches often rely on auxiliary cell-state encoders, hierarchical variational autoencoders, dedicated Transformer encoder-decoder modules, or gene-interaction priors to compress high-dimensional expression profiles into latent representations. While effective, these designs increase architectural complexity and may limit scalability and generalizability. This paper introduces OCOO-T, a minimalist flow-matching-based AIVC model for transcriptional perturbation response prediction. OCOO-T utilizes a vanilla Transformer stack that operates directly on continuous gene expression profiles and formulates perturbation response prediction as a continuous-time denoising process. Perturbation embeddings, dosage information, and cell-line/cell-type specificity are integrated through adaptive layer normalization and in-context tokens. Comprehensive evaluations on Tahoe100M, Replogle, and PBMC benchmarks demonstrate that OCOO-T achieves state-of-the-art performance across diverse perturbations and cell types while effectively scaling to long transcriptional profiles through patching and depatching of cellular contexts. By leveraging the simplicity of Transformer-based denoising for single-cell omics, OCOO-T provides an effective and scalable framework for in-silico cellular simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces OCOO-T, a minimalist flow-matching-based virtual cell model that uses a vanilla Transformer operating directly on continuous gene expression profiles to predict single-cell transcriptional responses to genetic, chemical, and cytokine perturbations. Perturbation embeddings, dosage, and cell specificity are incorporated via adaptive layer normalization and in-context tokens, with patching/depatching for scalability to long profiles. The central claim is that this simple architecture achieves state-of-the-art performance on the Tahoe100M, Replogle, and PBMC benchmarks across diverse perturbations and cell types.

Significance. If the performance claims hold with proper validation, this would be significant for AIVC modeling by showing that standard flow-matching and Transformer components can suffice without auxiliary encoders, hierarchical VAEs, or gene-interaction priors, potentially improving scalability and reproducibility. The emphasis on a parameter-light design using established techniques is a strength for the field.

major comments (1)
  1. [Abstract] Abstract: the assertion of state-of-the-art performance on Tahoe100M, Replogle, and PBMC benchmarks provides no quantitative metrics, baseline details, error analysis, or statistical comparisons, which is load-bearing for the central empirical claim and prevents verification of the reported improvements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and constructive comment. We address the concern about the abstract below and will incorporate the suggested changes in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of state-of-the-art performance on Tahoe100M, Replogle, and PBMC benchmarks provides no quantitative metrics, baseline details, error analysis, or statistical comparisons, which is load-bearing for the central empirical claim and prevents verification of the reported improvements.

    Authors: We agree that the abstract would be strengthened by including key quantitative metrics to support the SOTA claim. In the revised version, we will add concise performance highlights (e.g., primary metrics and baseline comparisons on each benchmark) drawn directly from the results tables, while preserving the abstract's length and readability. This addresses the verification concern without altering the manuscript's core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces OCOO-T as a minimalist flow-matching Transformer model for perturbation response prediction, relying on standard components (vanilla Transformer, adaptive layer norm, in-context tokens, patching) and reports empirical SOTA results on external benchmarks (Tahoe100M, Replogle, PBMC). No equations, derivations, or predictions are presented that reduce by construction to fitted parameters or self-defined quantities. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked; the central claims rest on benchmark performance rather than internal definitional closure. This is a standard empirical modeling paper with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the model appears to rest on standard assumptions of flow-matching and Transformer architectures without additional ad-hoc constructs.

pith-pipeline@v0.9.1-grok · 5779 in / 994 out tokens · 15605 ms · 2026-06-27T05:20:07.440192+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 21 canonical work pages · 1 internal anchor

  1. [1]

    URLhttps://doi.org/10.1038/s41592-024-02201-0

    Haotian Cui et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI.Nature Methods, 21(8):1470–1480, 2024. doi:10.1038/s41592-024-02201-0

  2. [2]

    Adduri et al

    Anish K. Adduri et al. Predicting cellular responses to perturbation across diverse contexts with State. bioRxiv, 2025. doi:10.1101/2025.06.26.661135. 17

  3. [3]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  4. [4]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling. InInternational Conference on Learning Representations (ICLR), 2023

  5. [5]

    Squidiff: predicting cellular development and responses to perturbations using a diffusion model.Nature Methods, 2025

    Sheng He et al. Squidiff: predicting cellular development and responses to perturbations using a diffusion model.Nature Methods, 2025. doi:10.1038/s41592-025-02877-y

  6. [6]

    bioRxiv , year=

    Dominik Klein et al. CellFlow Enables Generative Single-Cell Phenotype Modeling with Flow Matching. bioRxiv, 2025. doi:10.1101/2025.04.11.648220

  7. [7]

    scPPDM: A Diffusion Model for Single-Cell Drug-Response Prediction.arXiv preprint arXiv:2510.11726, 2025

    Zhaokang Liang, Shuyang Zhuang, Xiaoran Jiao, Weian Mao, Hao Chen, and Chunhua Shen. scPPDM: A Diffusion Model for Single-Cell Drug-Response Prediction.arXiv preprint arXiv:2510.11726, 2025

  8. [8]

    scDFM: Distributional Flow Matching Model for Robust Single-Cell Perturbation Prediction.arXiv preprint arXiv:2602.07103, 2026

    Chenglei Yu, Chuanrui Wang, Bangyan Liao, and Tailin Wu. scDFM: Distributional Flow Matching Model for Robust Single-Cell Perturbation Prediction.arXiv preprint arXiv:2602.07103, 2026

  9. [9]

    PerturbDiff: Functional Diffusion for Single-Cell Perturbation Modeling.arXiv preprint arXiv:2602.19685, 2026

    Xinyu Yuan, Xixian Liu, Ya Shi Zhang, Zuobai Zhang, Hongyu Guo, and Jian Tang. PerturbDiff: Functional Diffusion for Single-Cell Perturbation Modeling.arXiv preprint arXiv:2602.19685, 2026

  10. [10]

    Reddi, Aaditya Ramdas, Barnab ´as P ´oczos, Aarti Singh, and Larry Wasserman

    Sashank J. Reddi, Aaditya Ramdas, Barnab ´as P ´oczos, Aarti Singh, and Larry Wasserman. On the Decreasing Power of Kernel and Distance Based Nonparametric Hypothesis Tests in High Dimensions. InProceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI), 2015

  11. [11]

    10 Million Human PBMCs in a Single Experiment

    Parse Biosciences. 10 Million Human PBMCs in a Single Experiment. Dataset resource, 2023

  12. [12]

    Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling.bioRxiv, 2025

    Jesse Zhang et al. Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling.bioRxiv, 2025. doi:10.1101/2025.02.20.639398

  13. [13]

    Replogle, Alexander N

    Ajay Nadig, Joseph M. Replogle, Alexander N. Pogson, et al. Transcriptome-wide analysis of differential expression in perturbation atlases.Nature Genetics, 2025. doi:10.1038/s41588-025-02169-3

  14. [14]

    Alexander Wolf, and Fabian J

    Mohammad Lotfollahi, F. Alexander Wolf, and Fabian J. Theis. scGen predicts single-cell perturbation responses.Nature Methods, 16(8):715–721, 2019. doi:10.1038/s41592-019-0494-8

  15. [15]

    Predicting cellular responses to complex perturbations in high-throughput screens.Molecular Systems Biology, 19(6):e11517, 2023

    Mohammad Lotfollahi et al. Predicting cellular responses to complex perturbations in high-throughput screens.Molecular Systems Biology, 19(6):e11517, 2023. doi:10.15252/msb.202211517

  16. [16]

    Leon Hetzel, Simon B¨ohm, Niki Kilbertus, Stephan G¨unnemann, Mohammad Lotfollahi, and Fabian J. Theis. Predicting cellular responses to novel drug perturbations at a single-cell resolution. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  17. [17]

    Roohani, Kexin Huang, and Jure Leskovec

    Yusuf H. Roohani, Kexin Huang, and Jure Leskovec. Predicting transcriptional outcomes of novel multigene perturbations with GEARS.Nature Biotechnology, 42:927–935, 2024. doi:10.1038/s41587- 023-01905-6

  18. [18]

    Large-scale foundation model on single-cell transcriptomics.Nature Methods, 21(8):1481–1491, 2024

    Minsheng Hao et al. Large-scale foundation model on single-cell transcriptomics.Nature Methods, 21(8):1481–1491, 2024. doi:10.1038/s41592-024-02305-7

  19. [19]

    Ravindra, Lexi R

    Chloe Wang, Mehran Karimzadeh, Neal G. Ravindra, Lexi R. Bounds, et al. X-Cell: Scaling Causal Perturbation Prediction Across Diverse Cellular Contexts via Diffusion Language Models.bioRxiv,

  20. [20]

    doi:10.64898/2026.03.18.712807. 18

  21. [21]

    Mudge et al

    Jonathan M. Mudge et al. GENCODE 2025: reference gene annotation for human and mouse.Nucleic Acids Research, 53(D1):D966–D975, 2025

  22. [22]

    scLong: a billion-parameter foundation model for capturing long-range gene context in single-cell transcriptomics.Nature Communications, 17:2380, 2026

    Ding Bai et al. scLong: a billion-parameter foundation model for capturing long-range gene context in single-cell transcriptomics.Nature Communications, 17:2380, 2026. doi:10.1038/s41467-026-69102-y

  23. [23]

    Back to Basics: Let Denoising Generative Models Denoise

    Tianhong Li and Kaiming He. Back to Basics: Let Denoising Generative Models Denoise.arXiv preprint arXiv:2511.13720, 2025

  24. [24]

    Roohani, Tony J

    Yusuf H. Roohani, Tony J. Hua, Po-Yuan Tung, Lexi R. Bounds, et al. Virtual Cell Challenge: Toward a Turing Test for the Virtual Cell.Cell, 188(13):3370–3374, 2025. doi:10.1016/j.cell.2025.06.008

  25. [25]

    Gilbert et al

    Luke A. Gilbert et al. CRISPR-mediated modular RNA-guided regulation of transcription in eukaryotes. Cell, 154(2):442–451, 2013. doi:10.1016/j.cell.2013.06.044

  26. [26]

    Norman et al

    Thomas M. Norman et al. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes.Science, 365(6455):786–793, 2019. doi:10.1126/science.aax4438

  27. [27]

    cell-eval: Comprehensive suite for evaluating perturbation prediction models

    Arc Institute. cell-eval: Comprehensive suite for evaluating perturbation prediction models. GitHub repository, 2026

  28. [28]

    Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines.Nature Methods, 22(8):1657–1661,

    Constantin Ahlmann-Eltze, Wolfgang Huber, and Simon Anders. Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines.Nature Methods, 22(8):1657–1661,

  29. [29]

    doi:10.1038/s41592-025-02772-6

  30. [30]

    Weinstock, Alexander Battle, and Patrick Cahan

    Eli Kernfeld, Yanyu Yang, Joshua S. Weinstock, Alexander Battle, and Patrick Cahan. A comparison of computational methods for expression forecasting.Genome Biology, 26:388, 2025. doi:10.1186/s13059- 025-03840-y. 19 Appendix A Training Details Backbone of the denoiser model is a 12-layer Transformer with hidden size 768 and 12 attention heads (head dimensi...