Knowledge Graphs and Reasoning LLMs for Finding Simple Yet Effective Transcriptomic Perturbation Predictors

Jake Fawkes; Jason Hartford; Liam Hodgson

arxiv: 2606.08816 · v1 · pith:5XUFCDXQnew · submitted 2026-06-07 · 💻 cs.LG · cs.AI

Knowledge Graphs and Reasoning LLMs for Finding Simple Yet Effective Transcriptomic Perturbation Predictors

Jake Fawkes , Liam Hodgson , Jason Hartford This is my paper

Pith reviewed 2026-06-27 18:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords knowledge graphstranscriptomic perturbation predictionK-nearest neighbourreinforcement learninglarge language modelsgene knockoutsdifferential expressionvirtual cell models

0 comments

The pith

A simple K-nearest neighbour model from a biological knowledge graph outperforms most methods on out-of-distribution transcriptomic perturbation prediction, and RL-trained reasoning LLMs reach equivalent performance to state-of-the-art meth

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that biological knowledge graphs supply a useful notion of similarity between gene perturbations, so that a basic K-nearest neighbour lookup already produces strong predictions of gene-expression changes for knockouts never seen in training. This K-nearest neighbour baseline beats almost all competing approaches on out-of-distribution test cases. When a reasoning large language model is further trained with reinforcement learning to edit the chosen neighbourhood, its accuracy on the Replogle et al. (2022) cell lines matches current state-of-the-art models. The same reinforcement-learning step also raises the model's accuracy on the separate task of differential-expression prediction even though that task was never used as a training signal. The results indicate that knowledge graphs can function as effective priors and that reinforcement learning can turn language models into more reliable predictors of complex biological outcomes.

Core claim

A K-nearest neighbour approach that draws neighbours from a biological knowledge graph achieves highly competitive performance on out-of-distribution transcriptomic perturbation prediction. When a reasoning LLM is trained via reinforcement learning to make changes to the selected neighbourhood, the resulting model reaches performance equivalent to current state-of-the-art methods on the cell lines studied by Replogle et al. (2022). The reinforcement-learning stage additionally improves the LLM's accuracy on differential expression prediction despite receiving no direct supervision on that task.

What carries the argument

K-nearest neighbour lookup over a biological knowledge graph, refined by a reasoning LLM whose neighbourhood selections are optimised through reinforcement learning for predictive accuracy.

If this is right

The K-nearest neighbour model beats almost all methods on out-of-distribution perturbation prediction.
RL-trained reasoning LLMs achieve performance equivalent to state-of-the-art methods on the tested cell lines.
The same RL training improves downstream differential expression prediction without direct supervision on it.
Knowledge graphs function as model priors that aid extrapolation beyond the training set of perturbations.
LLMs can be refined into generalizable tools for predicting complex biological responses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Knowledge graphs could be combined with other model architectures to improve generalization in additional biological prediction settings.
The neighbourhood-selection approach might extend to chemical or CRISPR perturbations beyond single-gene knockouts.
Reinforcement learning on predictive accuracy could be tested on other scientific domains that require extrapolation from limited labeled data.

Load-bearing premise

Biological knowledge graphs supply a notion of similarity between perturbations that supports reliable extrapolation to unseen perturbations.

What would settle it

A new collection of out-of-distribution perturbations on which the plain K-nearest neighbour model falls below the performance of the majority of other methods would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.08816 by Jake Fawkes, Jason Hartford, Liam Hodgson.

**Figure 1.** Figure 1: Method Visualization We provide LLMs with access to biological knowledge graphs and prompt them to choose genes for perturbation performance. We then optimize them via reinforcement learning using predictive performance as a reward. This finds a simple predictive model which performs at the level of state of the art models. be well approximated by a simple nearest-neighbour average over the local neighbour… view at source ↗

**Figure 2.** Figure 2: Hierarchical model of cell responses. Our model abstracts the relationship between observed correlations (left) and underlying protein complexes (center) (Replogle et al., 2022). Genes in the same complex correlate highly; submodules are reflected in correlation blocks. This is represented as a causal DAG (right). This is clearly a simplification of biology, but it makes a useful testable prediction that p… view at source ↗

**Figure 3.** Figure 3: Surprising Effectiveness of Related Genes This figure compares the knowledge graph neighbour predictor on the cell lines from Replogle et al. (2022) against more complex perturbation prediction models in GEARS (Roohani et al., 2024b), scLAMBDA (Wang et al., 2024), TxPert (Wenkel et al., 2025) and an LLM based approach in Langpert (M¨artens et al.) using Claude Sonnet. We can see that the graph neighbour ou… view at source ↗

**Figure 4.** Figure 4: Graph neighbour predictor performance. (a) Our predictor is competitive with experimental reproducibility (Pearson ∆); shaded areas denote 60% and 90% performance gap quantiles, which account for majority of the gap. (b) Model performance correlates strongly with perturbation magnitude, meaning more discernible perturbations are easier to predict. (c) Knowledge graph neighbourhoods capture high-correlati… view at source ↗

**Figure 5.** Figure 5: Differential expression prediction performance. The post-trained model consistently outperforms the base model in the tasks from Wu et al. (2025) with and without SUMMER-based prompting. 6.2 Transfer In Reasoning Across Cell Lines In order to assess how the reasoning learnt by the LLM generalises across cell lines, we repeat the same task training on only 3 of the 4 cell lines. We choose to leave out K562… view at source ↗

**Figure 6.** Figure 6: Plots of the by gene Pearson delta for the experimental reproducibility vs the graph neighbour [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Ridge plots for the HepG2 cell line comparing the distribution of Pearson correlations between target genes (y-axis) and potential predictor genes (x-axis). The left column (“All Training Genes”) displays the background correlation distribution against all genes in the training set; the right column (“KG Neighbours Only”) displays the distribution restricted to genes identified as neighbours in the Knowled… view at source ↗

**Figure 8.** Figure 8: Ridge plots for the RPE1 cell line comparing Pearson correlation distributions. As in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Ridge plots for the K562 cell line comparing the distribution of Pearson correlations. The target-predictor relationship is shown for the full training set (left) versus restricted Knowledge Graph neighbours (right), demonstrating the correlation density shift for this specific cell line [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Ridge plots for the Jurkat cell line comparing Pearson correlation distributions between target genes and potential predictor genes. The data illustrates the contrast between the total training gene background (left) and the targeted Knowledge Graph neighbour distribution (right) [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Knowledge Graph neighbourhood size versus predictive performance. Genes are stratified [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Predictive performance versus perturbation magnitude. The metrics display diverging trends [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Predictive performance versus expression-control correlation. While the Pearson delta metric [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

read the original abstract

Predicting the effect of an unseen gene knockout perturbation on transcriptomic gene expression remains a highly challenging problem for virtual cell models. Recent progress has been made by leveraging biological knowledge graphs to provide a notion of similar perturbation, allowing for improved extrapolation beyond the set of training perturbations. In this work, we demonstrate that the simplest model to leverage these assumptions - a K-nearest neighbour from the knowledge graph - achieves highly competitive performance on this task, and that this can be improved further using LLMs optimised via reinforcement learning (RL) for predictive performance. Specifically, we find that the K-nearest neighbour approach beats almost all methods on out-of-distribution perturbation prediction, and when a reasoning LLM is trained via RL to make changes to the neighbourhood, it obtains equivalent performance to current state of the art methods on the cell lines from Replogle et al. (2022). We also demonstrate that the RL training improves the LLM's performance on the downstream task of differential expression prediction, despite not being trained on this directly. Overall, these findings demonstrate the efficacy of knowledge graphs as model priors, and show early signs that RL can refine LLMs into generalizable tools for predicting complex biological responses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A basic KG KNN already competes on out-of-distribution transcriptomic perturbation prediction, and RL can tune an LLM to SOTA while helping a downstream task.

read the letter

The key takeaway is that a plain K-nearest neighbor lookup on a biological knowledge graph already gives highly competitive results for predicting the transcriptomic effects of unseen gene knockouts, and that an RL-optimized reasoning LLM can adjust the neighborhood to reach parity with state-of-the-art methods on the Replogle et al. 2022 cell lines while also improving performance on differential expression prediction.

What the work does well is test the value of the knowledge graph as a similarity prior in a direct, empirical way. The finding that the simple KNN beats almost all other methods on out-of-distribution cases is a clear result, and showing that the RL step helps on a task it was not explicitly trained for adds an interesting angle. Using the Replogle dataset keeps the evaluation grounded in real experimental data rather than synthetic benchmarks.

The soft spots are mostly around missing details. The abstract does not spell out the exact baselines, the construction of the knowledge graph, the data splitting procedure, or any statistical testing, so it is hard to judge how much of the performance comes from the KG itself versus implementation choices. The RL training description is high-level, leaving open questions about the reward design and whether the neighborhood adjustments are stable across different runs or cell lines. These are addressable in a full version but currently limit how strongly the claims can be evaluated.

This paper is for people working on virtual cell modeling, gene perturbation prediction, and ways to combine structured biological knowledge with language models. A reader focused on practical, lower-complexity approaches to extrapolation in biology would get value from it. It deserves a serious referee because the central empirical claim is falsifiable and relevant to an active area, even if the current write-up needs more methodological transparency to stand on its own.

I would send this to peer review.

Referee Report

3 major / 3 minor

Summary. The paper claims that a K-nearest-neighbor lookup over a biological knowledge graph already yields strong out-of-distribution performance for predicting transcriptomic effects of unseen gene-knockout perturbations, and that reinforcement learning can be used to train a reasoning LLM to refine neighborhood selection, reaching parity with current SOTA on the Replogle et al. (2022) cell lines while also improving a downstream differential-expression task without direct supervision on that metric.

Significance. If the empirical results hold under rigorous validation, the work would establish a simple, interpretable baseline that leverages existing biological knowledge graphs as a usable similarity prior, while showing that RL can adapt LLMs for biological prediction in a manner that transfers to related tasks. This would be a useful contribution to the virtual-cell modeling literature.

major comments (3)

[Abstract and §4] Abstract and §4 (Experimental Setup): the claim that the K-nearest-neighbour approach 'beats almost all methods' on out-of-distribution perturbation prediction is presented without naming the full set of baselines, reporting effect sizes, or providing statistical significance tests; this information is load-bearing for the central empirical claim.
[§5] §5 (RL Training) and Methods: the reward function and state representation used when the LLM modifies the KG-derived neighbourhood are not described, preventing assessment of whether the reported parity with SOTA is attributable to the KG prior, the RL objective, or both.
[§4.2] §4.2 (Evaluation Protocol): details on how out-of-distribution splits are constructed (e.g., perturbation-level vs. cell-line-level hold-out) and whether any leakage exists between the KG and the test perturbations are missing; these choices directly affect the extrapolation claim.

minor comments (3)

[Abstract] Abstract: the phrase 'current state of the art methods' should be accompanied by explicit citations to the Replogle et al. (2022) baselines being compared against.
[Results figures/tables] Figure 2 or Table 3 (whichever reports the main results): axis labels and legend entries should explicitly state the metric (e.g., Pearson correlation on log-fold changes) and the exact number of test perturbations.
[Throughout] Notation: define 'KG' and 'RL' on first use in the main text even if they appear in the abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights areas where additional details will strengthen the manuscript. We will revise the paper to provide the requested information on baselines, RL components, and evaluation protocol. This addresses the major comments without altering the core claims.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experimental Setup): the claim that the K-nearest-neighbour approach 'beats almost all methods' on out-of-distribution perturbation prediction is presented without naming the full set of baselines, reporting effect sizes, or providing statistical significance tests; this information is load-bearing for the central empirical claim.

Authors: We agree that the full set of baselines, effect sizes, and statistical tests must be reported to support the central claim. In the revised manuscript, we will expand §4 with a table enumerating all baselines (including recent virtual cell models), report effect sizes such as Cohen's d on MSE differences across cell lines, and include p-values from Wilcoxon signed-rank tests. This will allow rigorous evaluation of the K-NN performance. revision: yes
Referee: [§5] §5 (RL Training) and Methods: the reward function and state representation used when the LLM modifies the KG-derived neighbourhood are not described, preventing assessment of whether the reported parity with SOTA is attributable to the KG prior, the RL objective, or both.

Authors: We acknowledge the omission and will add a dedicated Methods subsection. The state includes the current KG neighborhood, perturbation embedding, and gene annotations; the reward is negative MSE on a held-out validation set of transcriptomic profiles plus a diversity bonus. The action space allows add/remove/replace operations on neighbors. These details will clarify the RL contribution relative to the KG prior. revision: yes
Referee: [§4.2] §4.2 (Evaluation Protocol): details on how out-of-distribution splits are constructed (e.g., perturbation-level vs. cell-line-level hold-out) and whether any leakage exists between the KG and the test perturbations are missing; these choices directly affect the extrapolation claim.

Authors: We will revise §4.2 to specify perturbation-level hold-outs (unseen gene knockouts with no training overlap) and separate cell-line hold-outs. The KG is built from static public databases predating the datasets; we will add checks confirming no direct leakage of test perturbation effects and discuss potential indirect overlaps. A split diagram will be included. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claims rest on empirical comparisons of a K-nearest-neighbour lookup over a biological knowledge graph against prior methods, plus an RL refinement step, evaluated on external datasets (Replogle et al. 2022 cell lines). No derivation chain, equation, or 'prediction' reduces to its own inputs by construction; the modelling choice (KG similarity prior) is explicitly the hypothesis being tested rather than an unexamined precondition, and results are reported as out-of-distribution performance metrics without self-referential fitting or self-citation load-bearing.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on specific free parameters, axioms or invented entities.

pith-pipeline@v0.9.1-grok · 5740 in / 1224 out tokens · 26212 ms · 2026-06-27T18:29:52.739535+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 8 canonical work pages · 1 internal anchor

[1]

URLhttps://doi.org/10.1038/s41592-025-02772-6

doi: 10.1038/s41592-025-02772-6. URLhttps://doi.org/10.1038/s41592-025-02772-6. B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter.Molecular Biology of the Cell. Garland Science, New York, NY, 4th edition,

work page doi:10.1038/s41592-025-02772-6
[2]

ISBN 9780815332181. I. Bendidi, S. T. Whitfield, K. Kenyon-Dean, H. B. Yedder, Y. El Mesbahi, E. Noutahi, and A. K. Denton. Benchmarking transcriptomics foundation models for perturbation analysis: one pca still rules them all. InNeurIPS 2024 Workshop on AI for New Drug Modalities. G. Brixi, M. G. Durrant, J. Ku, M. Poli, G. Brockman, D. Chang, G. A. Gonz...

2024
[3]

URL https://doi.org/10

doi: 10.1038/s41592-023-01969-x. URL https://doi.org/10. 1038/s41592-023-01969-x. C. Bunne, Y. Roohani, Y. Rosen, A. Gupta, X. Zhang, M. Roed, T. Alexandrov, M. AlQuraishi, P. Brennan, D. Burkhardt, A. Califano, J. Cool, A. Dernburg, K. Ewing, E. Fox, M. Haury, A. Herr, E. Horvitz, P. Hsu, V. Jain, G. Johnson, T. Kalil, D. Kelley, S. Kelley, A. Kreshuk, T...

work page doi:10.1038/s41592-023-01969-x
[4]

doi: 10.1016/j.cell.2024.11.015. Y. Chen and J. Zou. Genept: A simple but effective foundation model for genes and cells built from chatgpt.bioRxiv, Mar

work page doi:10.1016/j.cell.2024.11.015 2024
[5]

doi: 10.1101/2023.10

ISSN 2692-8205 (Electronic); 2692-8205 (Linking). doi: 10.1101/2023.10. 16.562533. G. O. Consortium. The gene ontology (go) database and informatics resource.Nucleic acids research, 32(suppl 1):D258–D261,

work page doi:10.1101/2023.10 2023
[6]

URLhttps://doi.org/10.1038/s41592-024-02201-0

doi: 10.1038/s41592-024-02201-0. URLhttps://doi.org/10.1038/s41592-024-02201-0. A. Fallahpour, A. Magnuson, P. Gupta, S. Ma, J. Naimer, A. Shah, H. Duan, O. Ibrahim, H. Goodarzi, C. J. Maddison, and B. WANG. Bioreason: Incentivizing multimodal biological reasoning within a DNA-LLM model. InThe Thirty-ninth Annual Conference on Neural Information Processin...

work page doi:10.1038/s41592-024-02201-0
[7]

URL https://arxiv.org/ abs/2404.16907. P. Intellect. Prime-rl,

arXiv
[8]

URLhttps://github.com/PrimeIntellect-ai/prime-rl. A.-M. Istrate, F. Milletari, F. Castrotorres, J. M. Tomczak, M. Torkar, D. Li, and T. Karaletsos. rbio1- training scientific reasoning llms with biological world models as soft verifiers.bioRxiv, pages 2025– 08,

2025
[9]

Korbak, E

T. Korbak, E. Perez, and C. Buckley. Rl with kl penalties is better viewed as bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1083–1091,

2022
[10]

Levine, S

D. Levine, S. A. Rizvi, S. L´ evy, N. Pallikkavaliyaveetil, D. Zhang, X. Chen, S. Ghadermarzi, R. Wu, Z. Zheng, I. Vrkic, et al. Cell2sentence: teaching large language models the language of biology. BioRxiv, pages 2023–09,

2023
[11]

Q. Liu, Q. Zhang, J. Du, S. Zhao, and J. Wang. Effects of distance metrics and scaling on the perturbation discrimination score.arXiv preprint arXiv:2511.16954,

arXiv
[12]

doi: 10.1038/s41592-019-0494-8. M. Lotfollahi, M. Naghipourfar, M. D. Luecken, M. Khajavi, M. B”uttner, ˇZ. Avsec, A. Gayoso, N. Yosef, F. J. Theis, and R. Lopez. Predicting cellular responses to complex perturbations in high- throughput screens.Molecular Systems Biology, 19(5):e11517,

work page doi:10.1038/s41592-019-0494-8
[13]

doi: 10.15252/msb.202211517. K. M¨ artens, M. B. Martell, C. A. Prada-Medina, and R. Donovan-Maiye. Langpert: Llm-driven contextual synthesis for unseen perturbation prediction. InICLR 2025 Workshop on Machine Learning for Genomics Explorations. K. Murphy. Reinforcement learning: an overview.arXiv preprint arXiv:2412.05265,

work page doi:10.15252/msb.202211517 2025
[14]

Noutahi, J

E. Noutahi, J. Hartford, P. Tossou, S. Whitfield, A. K. Denton, C. Wognum, K. Ulicna, M. Craig, J. Hsu, M. Cuccarese, et al. Virtual cells: Predict, explain, discover.arXiv preprint arXiv:2505.14613,

arXiv
[15]

Proximal Policy Optimization Algorithms

Y. Roohani, K. Huang, and J. Leskovec. Predicting transcriptional outcomes of novel multigene pertur- bations with gears.Nature Biotechnology, 42(6):927–935, 2024a. doi: 10.1038/s41587-023-01905-6. URLhttps://doi.org/10.1038/s41587-023-01905-6. Y. Roohani, K. Huang, and J. Leskovec. Predicting transcriptional outcomes of novel multigene perturbations with...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41587-023-01905-6
[16]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseek- math: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Pith/arXiv arXiv
[17]

Szklarczyk, R

D. Szklarczyk, R. Kirsch, M. Koutrouli, K. Nastou, F. Mehryary, R. Hachilif, A. L. Gable, T. Fang, N. T. Doncheva, S. Pyysalo, et al. The string database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest.Nucleic acids research, 51 (D1):D638–D646,

2023
[18]

X. Tang, Z. Yu, J. Chen, Y. Cui, D. Shao, W. Wang, F. Wu, Y. Zhuang, W. Shi, Z. Huang, et al. Cellforge: agentic design of virtual cell models.arXiv preprint arXiv:2508.02276,

arXiv
[19]

Spring 2012 Edition. F. Wenkel, W. Tu, C. Masschelein, H. Shirzad, C. Eastwood, S. T. Whitfield, I. Bendidi, C. Russell, L. Hodgson, Y. E. Mesbahi, et al. Txpert: Leveraging biochemical relationships for out-of-distribution transcriptomic perturbation prediction.arXiv preprint arXiv:2505.14919,

arXiv 2012
[20]

M. Wu, R. Littman, J. Levine, L. Qiu, T. Biancalani, D. Richmond, and J.-C. Huetter. Contextualizing biological perturbation experiments through language. InThe Thirteenth International Conference on Learning Representations. M. Wu, R. Littman, J. Levine, L. Qiu, T. Biancalani, D. Richmond, and J.-C. Huetter. Contextualizing biological perturbation experi...

arXiv
[21]

Y. Wu, E. Wershof, S. M. Schmon, M. Nassar, B. Osi´ nski, R. Eksi, Z. Yan, R. Stark, K. Zhang, and T. Graepel. Perturbench: Benchmarking machine learning models for cellular perturbation analysis. arXiv preprint arXiv:2408.10609,

arXiv
[22]

Zheng, S

C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

Pith/arXiv arXiv
[23]

K562 RPE1 HEPG2 Jurkat ModelPear

URL https://arxiv.org/abs/ 2409.07594. K562 RPE1 HEPG2 Jurkat ModelPear. ∆(↑) Retr. (↑) Pear. ∆(↑) Retr. (↑) Pear. ∆(↑) Retr. (↑) Pear. ∆(↑) Retr. (↑) LLM, RL + KG on 4 Cell lines 0.617 ±.014 0.840±.013 0.671±.014 0.643±.017 0.485±.016 0.655±.017 0.504±.014 0.734±.016 LLM, RL + KG on 3 Cell lines 0.608 ±.015 0.834±.013 0.666±.014 0.639±.017 0.479±.016 0.6...

arXiv
[24]

All Training Genes

Scores are reported as Mean ±se for standard errors. All metrics are calculated relative to the predicted and true deltas. A Experiment Details A.1 Metric Definitions r(ˆδp, δp) = P ˆδp,i −µ( ˆδp) ¯δp,i −µ( ¯δp) r P ˆδp,i −µ( ˆδp) 2P ¯δp,i −µ( ¯δp) 2 ,(2) where µ(·) is the mean of the vector over its entries. We extend this to matrices by computing it in ...

2025

[1] [1]

URLhttps://doi.org/10.1038/s41592-025-02772-6

doi: 10.1038/s41592-025-02772-6. URLhttps://doi.org/10.1038/s41592-025-02772-6. B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter.Molecular Biology of the Cell. Garland Science, New York, NY, 4th edition,

work page doi:10.1038/s41592-025-02772-6

[2] [2]

ISBN 9780815332181. I. Bendidi, S. T. Whitfield, K. Kenyon-Dean, H. B. Yedder, Y. El Mesbahi, E. Noutahi, and A. K. Denton. Benchmarking transcriptomics foundation models for perturbation analysis: one pca still rules them all. InNeurIPS 2024 Workshop on AI for New Drug Modalities. G. Brixi, M. G. Durrant, J. Ku, M. Poli, G. Brockman, D. Chang, G. A. Gonz...

2024

[3] [3]

URL https://doi.org/10

doi: 10.1038/s41592-023-01969-x. URL https://doi.org/10. 1038/s41592-023-01969-x. C. Bunne, Y. Roohani, Y. Rosen, A. Gupta, X. Zhang, M. Roed, T. Alexandrov, M. AlQuraishi, P. Brennan, D. Burkhardt, A. Califano, J. Cool, A. Dernburg, K. Ewing, E. Fox, M. Haury, A. Herr, E. Horvitz, P. Hsu, V. Jain, G. Johnson, T. Kalil, D. Kelley, S. Kelley, A. Kreshuk, T...

work page doi:10.1038/s41592-023-01969-x

[4] [4]

doi: 10.1016/j.cell.2024.11.015. Y. Chen and J. Zou. Genept: A simple but effective foundation model for genes and cells built from chatgpt.bioRxiv, Mar

work page doi:10.1016/j.cell.2024.11.015 2024

[5] [5]

doi: 10.1101/2023.10

ISSN 2692-8205 (Electronic); 2692-8205 (Linking). doi: 10.1101/2023.10. 16.562533. G. O. Consortium. The gene ontology (go) database and informatics resource.Nucleic acids research, 32(suppl 1):D258–D261,

work page doi:10.1101/2023.10 2023

[6] [6]

URLhttps://doi.org/10.1038/s41592-024-02201-0

doi: 10.1038/s41592-024-02201-0. URLhttps://doi.org/10.1038/s41592-024-02201-0. A. Fallahpour, A. Magnuson, P. Gupta, S. Ma, J. Naimer, A. Shah, H. Duan, O. Ibrahim, H. Goodarzi, C. J. Maddison, and B. WANG. Bioreason: Incentivizing multimodal biological reasoning within a DNA-LLM model. InThe Thirty-ninth Annual Conference on Neural Information Processin...

work page doi:10.1038/s41592-024-02201-0

[7] [7]

URL https://arxiv.org/ abs/2404.16907. P. Intellect. Prime-rl,

arXiv

[8] [8]

URLhttps://github.com/PrimeIntellect-ai/prime-rl. A.-M. Istrate, F. Milletari, F. Castrotorres, J. M. Tomczak, M. Torkar, D. Li, and T. Karaletsos. rbio1- training scientific reasoning llms with biological world models as soft verifiers.bioRxiv, pages 2025– 08,

2025

[9] [9]

Korbak, E

T. Korbak, E. Perez, and C. Buckley. Rl with kl penalties is better viewed as bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1083–1091,

2022

[10] [10]

Levine, S

D. Levine, S. A. Rizvi, S. L´ evy, N. Pallikkavaliyaveetil, D. Zhang, X. Chen, S. Ghadermarzi, R. Wu, Z. Zheng, I. Vrkic, et al. Cell2sentence: teaching large language models the language of biology. BioRxiv, pages 2023–09,

2023

[11] [11]

Q. Liu, Q. Zhang, J. Du, S. Zhao, and J. Wang. Effects of distance metrics and scaling on the perturbation discrimination score.arXiv preprint arXiv:2511.16954,

arXiv

[12] [12]

doi: 10.1038/s41592-019-0494-8. M. Lotfollahi, M. Naghipourfar, M. D. Luecken, M. Khajavi, M. B”uttner, ˇZ. Avsec, A. Gayoso, N. Yosef, F. J. Theis, and R. Lopez. Predicting cellular responses to complex perturbations in high- throughput screens.Molecular Systems Biology, 19(5):e11517,

work page doi:10.1038/s41592-019-0494-8

[13] [13]

doi: 10.15252/msb.202211517. K. M¨ artens, M. B. Martell, C. A. Prada-Medina, and R. Donovan-Maiye. Langpert: Llm-driven contextual synthesis for unseen perturbation prediction. InICLR 2025 Workshop on Machine Learning for Genomics Explorations. K. Murphy. Reinforcement learning: an overview.arXiv preprint arXiv:2412.05265,

work page doi:10.15252/msb.202211517 2025

[14] [14]

Noutahi, J

E. Noutahi, J. Hartford, P. Tossou, S. Whitfield, A. K. Denton, C. Wognum, K. Ulicna, M. Craig, J. Hsu, M. Cuccarese, et al. Virtual cells: Predict, explain, discover.arXiv preprint arXiv:2505.14613,

arXiv

[15] [15]

Proximal Policy Optimization Algorithms

Y. Roohani, K. Huang, and J. Leskovec. Predicting transcriptional outcomes of novel multigene pertur- bations with gears.Nature Biotechnology, 42(6):927–935, 2024a. doi: 10.1038/s41587-023-01905-6. URLhttps://doi.org/10.1038/s41587-023-01905-6. Y. Roohani, K. Huang, and J. Leskovec. Predicting transcriptional outcomes of novel multigene perturbations with...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41587-023-01905-6

[16] [16]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseek- math: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Pith/arXiv arXiv

[17] [17]

Szklarczyk, R

D. Szklarczyk, R. Kirsch, M. Koutrouli, K. Nastou, F. Mehryary, R. Hachilif, A. L. Gable, T. Fang, N. T. Doncheva, S. Pyysalo, et al. The string database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest.Nucleic acids research, 51 (D1):D638–D646,

2023

[18] [18]

X. Tang, Z. Yu, J. Chen, Y. Cui, D. Shao, W. Wang, F. Wu, Y. Zhuang, W. Shi, Z. Huang, et al. Cellforge: agentic design of virtual cell models.arXiv preprint arXiv:2508.02276,

arXiv

[19] [19]

Spring 2012 Edition. F. Wenkel, W. Tu, C. Masschelein, H. Shirzad, C. Eastwood, S. T. Whitfield, I. Bendidi, C. Russell, L. Hodgson, Y. E. Mesbahi, et al. Txpert: Leveraging biochemical relationships for out-of-distribution transcriptomic perturbation prediction.arXiv preprint arXiv:2505.14919,

arXiv 2012

[20] [20]

M. Wu, R. Littman, J. Levine, L. Qiu, T. Biancalani, D. Richmond, and J.-C. Huetter. Contextualizing biological perturbation experiments through language. InThe Thirteenth International Conference on Learning Representations. M. Wu, R. Littman, J. Levine, L. Qiu, T. Biancalani, D. Richmond, and J.-C. Huetter. Contextualizing biological perturbation experi...

arXiv

[21] [21]

Y. Wu, E. Wershof, S. M. Schmon, M. Nassar, B. Osi´ nski, R. Eksi, Z. Yan, R. Stark, K. Zhang, and T. Graepel. Perturbench: Benchmarking machine learning models for cellular perturbation analysis. arXiv preprint arXiv:2408.10609,

arXiv

[22] [22]

Zheng, S

C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

Pith/arXiv arXiv

[23] [23]

K562 RPE1 HEPG2 Jurkat ModelPear

URL https://arxiv.org/abs/ 2409.07594. K562 RPE1 HEPG2 Jurkat ModelPear. ∆(↑) Retr. (↑) Pear. ∆(↑) Retr. (↑) Pear. ∆(↑) Retr. (↑) Pear. ∆(↑) Retr. (↑) LLM, RL + KG on 4 Cell lines 0.617 ±.014 0.840±.013 0.671±.014 0.643±.017 0.485±.016 0.655±.017 0.504±.014 0.734±.016 LLM, RL + KG on 3 Cell lines 0.608 ±.015 0.834±.013 0.666±.014 0.639±.017 0.479±.016 0.6...

arXiv

[24] [24]

All Training Genes

Scores are reported as Mean ±se for standard errors. All metrics are calculated relative to the predicted and true deltas. A Experiment Details A.1 Metric Definitions r(ˆδp, δp) = P ˆδp,i −µ( ˆδp) ¯δp,i −µ( ¯δp) r P ˆδp,i −µ( ˆδp) 2P ¯δp,i −µ( ¯δp) 2 ,(2) where µ(·) is the mean of the vector over its entries. We extend this to matrices by computing it in ...

2025