PerturbCellRL: Verifier-Guided Reinforcement Learning for Single-Cell Perturbation Prediction
Pith reviewed 2026-06-29 05:09 UTC · model grok-4.3
The pith
PerturbCellRL post-trains a flow-matching generator with four cell-level verifiers as RL rewards to improve biological consistency of individual perturbation predictions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PerturbCellRL frames trustworthy single-cell prediction as verifier-guided generative alignment, where a pretrained flow-matching generator is post-trained via RL so that individual generated cells satisfy cell-level verifiers for Pearson similarity, RMSE proximity, differential-expression Spearman rank, and pathway activity.
What carries the argument
Reinforcement learning post-training that treats four cell-level verifiers (Pearson top-k similarity, RMSE top-k proximity, DE Spearman, Pathway activity) as reward functions to align a pretrained flow-matching generator.
If this is right
- Improves over the pretrained flow-matching generator on reward-aligned evaluation metrics.
- Improves on a held-out evaluation metric.
- Remains competitive with state-of-the-art methods on population-level metrics.
- Moves single-cell perturbation modeling from distribution matching toward explicit per-cell biological consistency checks.
Where Pith is reading between the lines
- The same verifier-reward structure could be applied to other generative architectures beyond flow matching.
- Pathway-activity verifiers may transfer to new perturbation classes once the relevant biology is catalogued.
- Adding or replacing verifiers could target additional single-cell features such as cell-type specificity or temporal dynamics.
Load-bearing premise
The four verifiers accurately capture biological consistency at the single-cell level without introducing systematic bias or overlooking key response features.
What would settle it
Wet-lab experiments that measure actual transcriptional responses of cells to the same perturbations and compare them directly against the scores assigned by the four verifiers on PerturbCellRL outputs.
Figures
read the original abstract
Single-cell perturbation models can reduce costly wet-lab screening by predicting how cells respond transcriptionally to interventions. While recent generative models improve population-level prediction, individual generated cells are not explicitly checked for biological consistency. We introduce PerturbCellRL, a reinforcement learning (RL) framework that post-trains a pretrained single-cell transcriptomic generator using a suite of cell-level verifiers as rewards. These verifiers define four rewards: Pearson top-k similarity, RMSE top-k proximity, DE Spearman, and Pathway activity. The Pathway activity verifier rewards cells whose pathway responses match known perturbation biology. We evaluate PerturbCellRL on multiple genetic and chemical perturbation benchmarks. Across these benchmarks, PerturbCellRL improves over the pretrained flow-matching generator on reward-aligned evaluation metrics and a held-out evaluation metric. Moreover, PerturbCellRL remains competitive with state-of-the-art methods on population-level metrics. Together, these results frame trustworthy single-cell prediction as verifier-guided generative alignment, moving beyond matching expression distributions toward predictions whose single-cell perturbation effects are explicitly checked for biological consistency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PerturbCellRL, a reinforcement learning framework for post-training a pretrained flow-matching generator on single-cell transcriptomic perturbation data. Rewards are defined by four independent cell-level verifiers (Pearson top-k similarity, RMSE top-k proximity, DE Spearman correlation, and Pathway activity matching known perturbation biology). The central claim is that this verifier-guided alignment yields improvements over the base generator on both reward-aligned metrics and a held-out evaluation metric, while remaining competitive with state-of-the-art methods on population-level statistics across genetic and chemical perturbation benchmarks.
Significance. If the verifiers prove faithful, the work offers a concrete route to enforce single-cell biological consistency in generative perturbation models rather than relying solely on distributional matching. The use of external, independent verifiers is a methodological strength that avoids obvious circularity between reward and evaluation.
major comments (2)
- [Abstract (verifier definitions and evaluation claims)] The load-bearing claim that the four verifiers accurately capture biological consistency at the single-cell level without systematic bias is not accompanied by any ablation, sensitivity analysis, or comparison against alternative biological readouts in the provided abstract; this directly affects whether the reported gains on reward-aligned and held-out metrics can be attributed to improved biological fidelity.
- [Abstract (results paragraph)] No quantitative results, statistical tests, or data-split details are supplied to support the statements that PerturbCellRL 'improves over the pretrained flow-matching generator' and 'remains competitive with state-of-the-art methods'; without these, the magnitude and robustness of the claimed gains cannot be assessed.
minor comments (1)
- The abstract would benefit from naming the specific benchmarks and the identity of the held-out evaluation metric.
Simulated Author's Rebuttal
We thank the referee for their comments on the abstract of our manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract (verifier definitions and evaluation claims)] The load-bearing claim that the four verifiers accurately capture biological consistency at the single-cell level without systematic bias is not accompanied by any ablation, sensitivity analysis, or comparison against alternative biological readouts in the provided abstract; this directly affects whether the reported gains on reward-aligned and held-out metrics can be attributed to improved biological fidelity.
Authors: We agree that the abstract itself does not include ablations, sensitivity analyses, or comparisons to alternative readouts. The full manuscript presents these validations in Sections 4.3 (verifier design and biological grounding) and 5.2 (sensitivity and alternative readout comparisons), where we show the verifiers align with known perturbation biology without evident circularity. Due to abstract length constraints, such details are summarized rather than expanded. We will revise the abstract to include a short clause referencing the validation performed in the main text and supplement. revision: partial
-
Referee: [Abstract (results paragraph)] No quantitative results, statistical tests, or data-split details are supplied to support the statements that PerturbCellRL 'improves over the pretrained flow-matching generator' and 'remains competitive with state-of-the-art methods'; without these, the magnitude and robustness of the claimed gains cannot be assessed.
Authors: The abstract omits specific numbers, tests, and split details to preserve readability and emphasize the methodological framing. All quantitative results, including effect sizes, statistical tests, and data-split protocols, appear in Tables 1–3, Figure 2, and Section 3 of the main text. We will revise the abstract to incorporate one or two key quantitative statements (e.g., average improvement on held-out metric) while respecting length limits. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper describes a standard RL post-training loop that maximizes fixed external verifiers (Pearson top-k, RMSE top-k, DE Spearman, Pathway activity) on a pretrained generator. Reported gains on reward-aligned metrics are expected by construction of RL, but the central claims also include improvement on a held-out metric and competitiveness on independent population-level metrics. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes are present in the provided text that reduce the result to its inputs. The verifiers are defined on external biological criteria and are not shown to be constructed from the same data or loop they evaluate.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Predicting cellular responses to perturbation across diverse contexts with state.BioRxiv, pages 2025–06, 2025
Abhinav K Adduri, Dhruv Gautam, Beatrice Bevilacqua, Alishba Imran, Rohan Shah, Mohsen Naghipourfar, Noam Teyssier, Rajesh Ilango, Sanjay Nagaraj, Mingze Dong, et al. Predicting cellular responses to perturbation across diverse contexts with state.BioRxiv, pages 2025–06, 2025. 3, 7
2025
-
[2]
Modelling cellular perturbations with the sparse additive mechanism shift variational autoencoder
Michael Bereket and Theofanis Karaletsos. Modelling cellular perturbations with the sparse additive mechanism shift variational autoencoder. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 1–12, 2023. 3
2023
-
[3]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
How to build the virtual cell with artificial intelligence: Priorities and opportunities.Cell, 2024
Charlotte Bunne, Yusuf Roohani, Yanay Rosen, Ankit Gupta, Xikun Zhang, Marcel Roed, Theo Alexandrov, Mohammed AlQuraishi, Patricia Brennan, Daniel B Burkhardt, et al. How to build the virtual cell with artificial intelligence: Priorities and opportunities.Cell, 2024. 1
2024
-
[5]
Learning single-cell perturbation responses using neural optimal transport.Nature methods, 20(11):1759–1768, 2023
Charlotte Bunne, Stefan G Stark, Gabriele Gut, Jacobo Sarabia Del Castillo, Mitch Levesque, Kjong-Van Lehmann, Lucas Pelkmans, Andreas Krause, and Gunnar R ¨atsch. Learning single-cell perturbation responses using neural optimal transport.Nature methods, 20(11):1759–1768, 2023. 1
2023
-
[6]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023. 3
2023
-
[8]
Building the next generation of virtual cells to understand cellular biology.Biophysical Journal, 2023
Graham T Johnson, Eran Agmon, Matthew Akamatsu, Emma Lundberg, Blair Lyons, Wei Ouyang, Omar A Quintero-Carmona, Megan Riel-Mehan, Susanne Rafelski, and Rick Horwitz. Building the next generation of virtual cells to understand cellular biology.Biophysical Journal, 2023. 1
2023
-
[9]
Cellflow enables generative single-cell phenotype modeling with flow matching.bioRxiv, pages 2025–04, 2025
Dominik Klein, Jonas Simon Fleck, Daniil Bobrovskiy, Lea Zimmermann, S¨oren Becker, Alessandro Palma, Le- ander Dony, Alejandro Tejada-Lapuerta, Guillaume Huguet, Hsiu-Chuan Lin, et al. Cellflow enables generative single-cell phenotype modeling with flow matching.bioRxiv, pages 2025–04, 2025. 1, 3, 7
2025
-
[10]
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Let’s verify step by step
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023. 3, 7
2023
-
[12]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. 3
2023
-
[13]
Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code.arXiv preprint arXiv:2412.06264,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
3 10 PerturbCellRLA PREPRINT
-
[15]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023. 3
2023
-
[17]
Deep generative modeling for single-cell transcriptomics.Nature methods, 15(12):1053–1058, 2018
Romain Lopez, Jeffrey Regier, Michael B Cole, Michael I Jordan, and Nir Yosef. Deep generative modeling for single-cell transcriptomics.Nature methods, 15(12):1053–1058, 2018. 3
2018
-
[18]
Predicting cellular responses to complex perturbations in high-throughput screens.Molecular systems biology, 19(6):MSB202211517, 2023
Mohammad Lotfollahi, Anna Klimovskaia Susmelj, Carlo De Donno, Leon Hetzel, Yuge Ji, Ignacio L Ibarra, Sanjay R Srivatsan, Mohsen Naghipourfar, Riza M Daza, Beth Martin, et al. Predicting cellular responses to complex perturbations in high-throughput screens.Molecular systems biology, 19(6):MSB202211517, 2023. 3, 7
2023
-
[19]
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-time scaling for diffusion models beyond scaling denoising steps. arXiv preprint arXiv:2501.09732, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Combi-seq for multiplexed transcriptome-based profiling of drug combinations using deterministic barcoding in single-cell droplets.Nature communications, 13(1):4450,
Lukas Mathur, B Szalai, NH Du, Ramesh Utharala, Martine Ballinger, JJM Landry, M Ryckelynck, Vladimir Benes, Julio Saez-Rodriguez, and Christoph A Merten. Combi-seq for multiplexed transcriptome-based profiling of drug combinations using deterministic barcoding in single-cell droplets.Nature communications, 13(1):4450,
-
[21]
Exploring genetic interaction manifolds constructed from rich single-cell phenotypes
Thomas M Norman, Max A Horlbeck, Joseph M Replogle, Alex Y Ge, Albert Xu, Marco Jost, Luke A Gilbert, and Jonathan S Weissman. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science, 365(6455):786–793, 2019. 1, 3, 7
2019
-
[22]
scperturb: harmonized single-cell perturbation data.Nature Methods, 21(3):531–540, 2024
Stefan Peidli, Tessa D Green, Ciyue Shen, Torsten Gross, Joseph Min, Samuele Garda, Bo Yuan, Linus J Schu- macher, Jake P Taylor-King, Debora S Marks, et al. scperturb: harmonized single-cell perturbation data.Nature Methods, 21(3):531–540, 2024. 3, 14
2024
-
[23]
Dr.vae: im- proving drug response prediction via modeling of drug perturbation effects.Bioinformatics, 35(19):3743–3751, 03 2019
Ladislav Ramp ´aˇsek, Daniel Hidru, Petr Smirnov, Benjamin Haibe-Kains, and Anna Goldenberg. Dr.vae: im- proving drug response prediction via modeling of drug perturbation effects.Bioinformatics, 35(19):3743–3751, 03 2019. 3
2019
-
[24]
Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq.Cell, 185(14):2559–2575, 2022
Joseph M Replogle, Reuben A Saunders, Angela N Pogson, Jeffrey A Hussmann, Alexander Lenail, Alina Guna, Lauren Mascibroda, Eric J Wagner, Karen Adelman, Gila Lithwick-Yanai, et al. Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq.Cell, 185(14):2559–2575, 2022. 1
2022
-
[25]
Predicting transcriptional outcomes of novel multigene per- turbations with gears.Nature Biotechnology, 42(6):927–935, 2024
Yusuf Roohani, Kexin Huang, and Jure Leskovec. Predicting transcriptional outcomes of novel multigene per- turbations with gears.Nature Biotechnology, 42(6):927–935, 2024. 3, 7
2024
-
[26]
Virtual cell challenge: Toward a turing test for the virtual cell.Cell, 188(13):3370–3374, 2025
Yusuf H Roohani, Tony J Hua, Po-Yuan Tung, Lexi R Bounds, Feiqiao B Yu, Alexander Dobin, Noam Teyssier, Abhinav Adduri, Alden Woodrow, Brian S Plosky, et al. Virtual cell challenge: Toward a turing test for the virtual cell.Cell, 188(13):3370–3374, 2025. 3
2025
-
[27]
Perturbation-response genes reveal signaling footprints in cancer gene expression.Nature communications, 9(1):20, 2018
Michael Schubert, Bertram Klinger, Martina Kl ¨unemann, Anja Sieber, Florian Uhlitz, Sascha Sauer, Mathew J Garnett, Nils Bl ¨uthgen, and Julio Saez-Rodriguez. Perturbation-response genes reveal signaling footprints in cancer gene expression.Nature communications, 9(1):20, 2018. 2, 5
2018
-
[28]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 2, 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Systema: a framework for evaluating genetic perturbation response prediction beyond system- atic variation.Nature Biotechnology, pages 1–10, 2025
Ramon Vi ˜nas Torn´e, Maciej Wiatrak, Zoe Piran, Shuyang Fan, Liangze Jiang, Sarah A Teichmann, Mor Nitzan, and Maria Brbi´c. Systema: a framework for evaluating genetic perturbation response prediction beyond system- atic variation.Nature Biotechnology, pages 1–10, 2025. 2
2025
-
[30]
CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning
Dongxia Wu, Shiye Su, Yuhui Zhang, Elaine Sui, Emma Lundberg, Emily B Fox, and Serena Yeung- Levy. Cellfluxrl: Biologically-constrained virtual cell modeling via reinforcement learning.arXiv preprint arXiv:2603.21743, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Chenglei Yu, Chuanrui Wang, Bangyan Liao, and Tailin Wu. scdfm: Distributional flow matching model for robust single-cell perturbation prediction.arXiv preprint arXiv:2602.07103, 2026. 1, 3, 7, 8
-
[33]
Cellflux: Simulating cellular morphology changes via flow matching
Yuhui Zhang, Yuchang Su, Chenyu Wang, Tianhong Li, Zoe Wefers, Jeffrey Nirschl, James Burgess, Daisy Ding, Alejandro Lozano, Emma Lundberg, et al. Cellflux: Simulating cellular morphology changes via flow matching. arXiv preprint arXiv:2502.09775, 2025. 3 11 PerturbCellRLA PREPRINT
-
[34]
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025. 2, 3, 6 12 PerturbCellRLA PREPRINT Algorithm 1PerturbCellRL: Verifier-Guided RL for scDFM Require:Pretrained scDFM velocit...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.