Knowledge Graphs and Reasoning LLMs for Finding Simple Yet Effective Transcriptomic Perturbation Predictors
Pith reviewed 2026-06-27 18:29 UTC · model grok-4.3
The pith
A simple K-nearest neighbour model from a biological knowledge graph outperforms most methods on out-of-distribution transcriptomic perturbation prediction, and RL-trained reasoning LLMs reach equivalent performance to state-of-the-art meth
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A K-nearest neighbour approach that draws neighbours from a biological knowledge graph achieves highly competitive performance on out-of-distribution transcriptomic perturbation prediction. When a reasoning LLM is trained via reinforcement learning to make changes to the selected neighbourhood, the resulting model reaches performance equivalent to current state-of-the-art methods on the cell lines studied by Replogle et al. (2022). The reinforcement-learning stage additionally improves the LLM's accuracy on differential expression prediction despite receiving no direct supervision on that task.
What carries the argument
K-nearest neighbour lookup over a biological knowledge graph, refined by a reasoning LLM whose neighbourhood selections are optimised through reinforcement learning for predictive accuracy.
If this is right
- The K-nearest neighbour model beats almost all methods on out-of-distribution perturbation prediction.
- RL-trained reasoning LLMs achieve performance equivalent to state-of-the-art methods on the tested cell lines.
- The same RL training improves downstream differential expression prediction without direct supervision on it.
- Knowledge graphs function as model priors that aid extrapolation beyond the training set of perturbations.
- LLMs can be refined into generalizable tools for predicting complex biological responses.
Where Pith is reading between the lines
- Knowledge graphs could be combined with other model architectures to improve generalization in additional biological prediction settings.
- The neighbourhood-selection approach might extend to chemical or CRISPR perturbations beyond single-gene knockouts.
- Reinforcement learning on predictive accuracy could be tested on other scientific domains that require extrapolation from limited labeled data.
Load-bearing premise
Biological knowledge graphs supply a notion of similarity between perturbations that supports reliable extrapolation to unseen perturbations.
What would settle it
A new collection of out-of-distribution perturbations on which the plain K-nearest neighbour model falls below the performance of the majority of other methods would falsify the central performance claim.
Figures
read the original abstract
Predicting the effect of an unseen gene knockout perturbation on transcriptomic gene expression remains a highly challenging problem for virtual cell models. Recent progress has been made by leveraging biological knowledge graphs to provide a notion of similar perturbation, allowing for improved extrapolation beyond the set of training perturbations. In this work, we demonstrate that the simplest model to leverage these assumptions - a K-nearest neighbour from the knowledge graph - achieves highly competitive performance on this task, and that this can be improved further using LLMs optimised via reinforcement learning (RL) for predictive performance. Specifically, we find that the K-nearest neighbour approach beats almost all methods on out-of-distribution perturbation prediction, and when a reasoning LLM is trained via RL to make changes to the neighbourhood, it obtains equivalent performance to current state of the art methods on the cell lines from Replogle et al. (2022). We also demonstrate that the RL training improves the LLM's performance on the downstream task of differential expression prediction, despite not being trained on this directly. Overall, these findings demonstrate the efficacy of knowledge graphs as model priors, and show early signs that RL can refine LLMs into generalizable tools for predicting complex biological responses.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a K-nearest-neighbor lookup over a biological knowledge graph already yields strong out-of-distribution performance for predicting transcriptomic effects of unseen gene-knockout perturbations, and that reinforcement learning can be used to train a reasoning LLM to refine neighborhood selection, reaching parity with current SOTA on the Replogle et al. (2022) cell lines while also improving a downstream differential-expression task without direct supervision on that metric.
Significance. If the empirical results hold under rigorous validation, the work would establish a simple, interpretable baseline that leverages existing biological knowledge graphs as a usable similarity prior, while showing that RL can adapt LLMs for biological prediction in a manner that transfers to related tasks. This would be a useful contribution to the virtual-cell modeling literature.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experimental Setup): the claim that the K-nearest-neighbour approach 'beats almost all methods' on out-of-distribution perturbation prediction is presented without naming the full set of baselines, reporting effect sizes, or providing statistical significance tests; this information is load-bearing for the central empirical claim.
- [§5] §5 (RL Training) and Methods: the reward function and state representation used when the LLM modifies the KG-derived neighbourhood are not described, preventing assessment of whether the reported parity with SOTA is attributable to the KG prior, the RL objective, or both.
- [§4.2] §4.2 (Evaluation Protocol): details on how out-of-distribution splits are constructed (e.g., perturbation-level vs. cell-line-level hold-out) and whether any leakage exists between the KG and the test perturbations are missing; these choices directly affect the extrapolation claim.
minor comments (3)
- [Abstract] Abstract: the phrase 'current state of the art methods' should be accompanied by explicit citations to the Replogle et al. (2022) baselines being compared against.
- [Results figures/tables] Figure 2 or Table 3 (whichever reports the main results): axis labels and legend entries should explicitly state the metric (e.g., Pearson correlation on log-fold changes) and the exact number of test perturbations.
- [Throughout] Notation: define 'KG' and 'RL' on first use in the main text even if they appear in the abstract.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights areas where additional details will strengthen the manuscript. We will revise the paper to provide the requested information on baselines, RL components, and evaluation protocol. This addresses the major comments without altering the core claims.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experimental Setup): the claim that the K-nearest-neighbour approach 'beats almost all methods' on out-of-distribution perturbation prediction is presented without naming the full set of baselines, reporting effect sizes, or providing statistical significance tests; this information is load-bearing for the central empirical claim.
Authors: We agree that the full set of baselines, effect sizes, and statistical tests must be reported to support the central claim. In the revised manuscript, we will expand §4 with a table enumerating all baselines (including recent virtual cell models), report effect sizes such as Cohen's d on MSE differences across cell lines, and include p-values from Wilcoxon signed-rank tests. This will allow rigorous evaluation of the K-NN performance. revision: yes
-
Referee: [§5] §5 (RL Training) and Methods: the reward function and state representation used when the LLM modifies the KG-derived neighbourhood are not described, preventing assessment of whether the reported parity with SOTA is attributable to the KG prior, the RL objective, or both.
Authors: We acknowledge the omission and will add a dedicated Methods subsection. The state includes the current KG neighborhood, perturbation embedding, and gene annotations; the reward is negative MSE on a held-out validation set of transcriptomic profiles plus a diversity bonus. The action space allows add/remove/replace operations on neighbors. These details will clarify the RL contribution relative to the KG prior. revision: yes
-
Referee: [§4.2] §4.2 (Evaluation Protocol): details on how out-of-distribution splits are constructed (e.g., perturbation-level vs. cell-line-level hold-out) and whether any leakage exists between the KG and the test perturbations are missing; these choices directly affect the extrapolation claim.
Authors: We will revise §4.2 to specify perturbation-level hold-outs (unseen gene knockouts with no training overlap) and separate cell-line hold-outs. The KG is built from static public databases predating the datasets; we will add checks confirming no direct leakage of test perturbation effects and discuss potential indirect overlaps. A split diagram will be included. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's central claims rest on empirical comparisons of a K-nearest-neighbour lookup over a biological knowledge graph against prior methods, plus an RL refinement step, evaluated on external datasets (Replogle et al. 2022 cell lines). No derivation chain, equation, or 'prediction' reduces to its own inputs by construction; the modelling choice (KG similarity prior) is explicitly the hypothesis being tested rather than an unexamined precondition, and results are reported as out-of-distribution performance metrics without self-referential fitting or self-citation load-bearing.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URLhttps://doi.org/10.1038/s41592-025-02772-6
doi: 10.1038/s41592-025-02772-6. URLhttps://doi.org/10.1038/s41592-025-02772-6. B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter.Molecular Biology of the Cell. Garland Science, New York, NY, 4th edition,
-
[2]
ISBN 9780815332181. I. Bendidi, S. T. Whitfield, K. Kenyon-Dean, H. B. Yedder, Y. El Mesbahi, E. Noutahi, and A. K. Denton. Benchmarking transcriptomics foundation models for perturbation analysis: one pca still rules them all. InNeurIPS 2024 Workshop on AI for New Drug Modalities. G. Brixi, M. G. Durrant, J. Ku, M. Poli, G. Brockman, D. Chang, G. A. Gonz...
2024
-
[3]
doi: 10.1038/s41592-023-01969-x. URL https://doi.org/10. 1038/s41592-023-01969-x. C. Bunne, Y. Roohani, Y. Rosen, A. Gupta, X. Zhang, M. Roed, T. Alexandrov, M. AlQuraishi, P. Brennan, D. Burkhardt, A. Califano, J. Cool, A. Dernburg, K. Ewing, E. Fox, M. Haury, A. Herr, E. Horvitz, P. Hsu, V. Jain, G. Johnson, T. Kalil, D. Kelley, S. Kelley, A. Kreshuk, T...
-
[4]
doi: 10.1016/j.cell.2024.11.015. Y. Chen and J. Zou. Genept: A simple but effective foundation model for genes and cells built from chatgpt.bioRxiv, Mar
-
[5]
ISSN 2692-8205 (Electronic); 2692-8205 (Linking). doi: 10.1101/2023.10. 16.562533. G. O. Consortium. The gene ontology (go) database and informatics resource.Nucleic acids research, 32(suppl 1):D258–D261,
-
[6]
URLhttps://doi.org/10.1038/s41592-024-02201-0
doi: 10.1038/s41592-024-02201-0. URLhttps://doi.org/10.1038/s41592-024-02201-0. A. Fallahpour, A. Magnuson, P. Gupta, S. Ma, J. Naimer, A. Shah, H. Duan, O. Ibrahim, H. Goodarzi, C. J. Maddison, and B. WANG. Bioreason: Incentivizing multimodal biological reasoning within a DNA-LLM model. InThe Thirty-ninth Annual Conference on Neural Information Processin...
-
[7]
URL https://arxiv.org/ abs/2404.16907. P. Intellect. Prime-rl,
-
[8]
URLhttps://github.com/PrimeIntellect-ai/prime-rl. A.-M. Istrate, F. Milletari, F. Castrotorres, J. M. Tomczak, M. Torkar, D. Li, and T. Karaletsos. rbio1- training scientific reasoning llms with biological world models as soft verifiers.bioRxiv, pages 2025– 08,
2025
-
[9]
Korbak, E
T. Korbak, E. Perez, and C. Buckley. Rl with kl penalties is better viewed as bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1083–1091,
2022
-
[10]
Levine, S
D. Levine, S. A. Rizvi, S. L´ evy, N. Pallikkavaliyaveetil, D. Zhang, X. Chen, S. Ghadermarzi, R. Wu, Z. Zheng, I. Vrkic, et al. Cell2sentence: teaching large language models the language of biology. BioRxiv, pages 2023–09,
2023
-
[11]
Q. Liu, Q. Zhang, J. Du, S. Zhao, and J. Wang. Effects of distance metrics and scaling on the perturbation discrimination score.arXiv preprint arXiv:2511.16954,
-
[12]
doi: 10.1038/s41592-019-0494-8. M. Lotfollahi, M. Naghipourfar, M. D. Luecken, M. Khajavi, M. B”uttner, ˇZ. Avsec, A. Gayoso, N. Yosef, F. J. Theis, and R. Lopez. Predicting cellular responses to complex perturbations in high- throughput screens.Molecular Systems Biology, 19(5):e11517,
-
[13]
doi: 10.15252/msb.202211517. K. M¨ artens, M. B. Martell, C. A. Prada-Medina, and R. Donovan-Maiye. Langpert: Llm-driven contextual synthesis for unseen perturbation prediction. InICLR 2025 Workshop on Machine Learning for Genomics Explorations. K. Murphy. Reinforcement learning: an overview.arXiv preprint arXiv:2412.05265,
-
[14]
E. Noutahi, J. Hartford, P. Tossou, S. Whitfield, A. K. Denton, C. Wognum, K. Ulicna, M. Craig, J. Hsu, M. Cuccarese, et al. Virtual cells: Predict, explain, discover.arXiv preprint arXiv:2505.14613,
-
[15]
Proximal Policy Optimization Algorithms
Y. Roohani, K. Huang, and J. Leskovec. Predicting transcriptional outcomes of novel multigene pertur- bations with gears.Nature Biotechnology, 42(6):927–935, 2024a. doi: 10.1038/s41587-023-01905-6. URLhttps://doi.org/10.1038/s41587-023-01905-6. Y. Roohani, K. Huang, and J. Leskovec. Predicting transcriptional outcomes of novel multigene perturbations with...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41587-023-01905-6
-
[16]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseek- math: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
-
[17]
Szklarczyk, R
D. Szklarczyk, R. Kirsch, M. Koutrouli, K. Nastou, F. Mehryary, R. Hachilif, A. L. Gable, T. Fang, N. T. Doncheva, S. Pyysalo, et al. The string database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest.Nucleic acids research, 51 (D1):D638–D646,
2023
-
[18]
X. Tang, Z. Yu, J. Chen, Y. Cui, D. Shao, W. Wang, F. Wu, Y. Zhuang, W. Shi, Z. Huang, et al. Cellforge: agentic design of virtual cell models.arXiv preprint arXiv:2508.02276,
-
[19]
Spring 2012 Edition. F. Wenkel, W. Tu, C. Masschelein, H. Shirzad, C. Eastwood, S. T. Whitfield, I. Bendidi, C. Russell, L. Hodgson, Y. E. Mesbahi, et al. Txpert: Leveraging biochemical relationships for out-of-distribution transcriptomic perturbation prediction.arXiv preprint arXiv:2505.14919,
arXiv 2012
-
[20]
M. Wu, R. Littman, J. Levine, L. Qiu, T. Biancalani, D. Richmond, and J.-C. Huetter. Contextualizing biological perturbation experiments through language. InThe Thirteenth International Conference on Learning Representations. M. Wu, R. Littman, J. Levine, L. Qiu, T. Biancalani, D. Richmond, and J.-C. Huetter. Contextualizing biological perturbation experi...
-
[21]
Y. Wu, E. Wershof, S. M. Schmon, M. Nassar, B. Osi´ nski, R. Eksi, Z. Yan, R. Stark, K. Zhang, and T. Graepel. Perturbench: Benchmarking machine learning models for cellular perturbation analysis. arXiv preprint arXiv:2408.10609,
-
[22]
C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,
-
[23]
K562 RPE1 HEPG2 Jurkat ModelPear
URL https://arxiv.org/abs/ 2409.07594. K562 RPE1 HEPG2 Jurkat ModelPear. ∆(↑) Retr. (↑) Pear. ∆(↑) Retr. (↑) Pear. ∆(↑) Retr. (↑) Pear. ∆(↑) Retr. (↑) LLM, RL + KG on 4 Cell lines 0.617 ±.014 0.840±.013 0.671±.014 0.643±.017 0.485±.016 0.655±.017 0.504±.014 0.734±.016 LLM, RL + KG on 3 Cell lines 0.608 ±.015 0.834±.013 0.666±.014 0.639±.017 0.479±.016 0.6...
-
[24]
All Training Genes
Scores are reported as Mean ±se for standard errors. All metrics are calculated relative to the predicted and true deltas. A Experiment Details A.1 Metric Definitions r(ˆδp, δp) = P ˆδp,i −µ( ˆδp) ¯δp,i −µ( ¯δp) r P ˆδp,i −µ( ˆδp) 2P ¯δp,i −µ( ¯δp) 2 ,(2) where µ(·) is the mean of the vector over its entries. We extend this to matrices by computing it in ...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.