pith. machine review for the scientific record. sign in

arxiv: 2604.20263 · v1 · submitted 2026-04-22 · 🧬 q-bio.QM · cs.AI· cs.LG

Recognition: unknown

AROMA: Augmented Reasoning Over a Multimodal Architecture for Virtual Cell Genetic Perturbation Modeling

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:04 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.AIcs.LG
keywords virtual cell modelinggenetic perturbationmultimodal architectureknowledge graphperturbation predictionbiological reasoningzero-shot evaluation
0
0 comments X

The pith

AROMA combines text, graphs, and sequences with staged training to predict genetic perturbation effects on cells more accurately and interpretably.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AROMA to model how genetic changes alter molecular states inside virtual cells. It fuses textual evidence, biological network graphs, and protein sequence details into one system, then trains it in two stages to keep outputs both precise and traceable to real regulatory patterns. Supporting this are two new knowledge graphs and a dataset of over 498,000 perturbation examples called PerturbReason. The approach matters because reliable in silico predictions can reduce the need for many wet-lab experiments while making results easier to check against known biology. Tests show the method holds up across different cell types, including cases with no prior training data for that cell line and situations where biological knowledge is thin.

Core claim

AROMA integrates textual evidence, graph-topology information, and protein sequence features to model perturbation-target dependencies, and is trained with a two-stage optimization strategy to yield predictions that are both accurate and interpretable. The work also supplies two knowledge graphs and the PerturbReason dataset of more than 498k samples as reusable resources, with experiments confirming better performance than existing methods across cell lines plus robustness in zero-shot evaluation on unseen cells and in knowledge-sparse long-tail scenarios.

What carries the argument

The AROMA multimodal architecture that augments reasoning by fusing textual evidence, graph topology, and protein sequences, then applies two-stage optimization to align outputs with regulatory relationships.

If this is right

  • Virtual cell simulations become more usable for studying how genetic changes drive molecular outcomes.
  • Predictions gain interpretability so researchers can trace outputs back to specific evidence sources.
  • The supplied knowledge graphs and PerturbReason dataset become shared tools for other virtual cell work.
  • Performance in zero-shot and long-tail settings indicates the method can handle realistic gaps in biological data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Researchers could use the model to prioritize which gene edits to test first in actual lab work.
  • The same fusion of text and graph signals might extend to predicting effects in other biological systems such as metabolic pathways.
  • If the interpretability holds outside the authors' tests, it opens a route to hybrid human-AI validation loops where biologists review the model's reasoning steps directly.

Load-bearing premise

The knowledge graphs and PerturbReason dataset must supply signals that truly match real biological regulatory connections, and the two-stage training must improve genuine understanding instead of just matching the particular data splits used.

What would settle it

Run AROMA on an independent collection of genetic perturbation experiments in a fresh cell line never seen in training, then check whether its accuracy stays higher than baselines and whether its reasoning steps line up with established experimental biology.

Figures

Figures reproduced from arXiv: 2604.20263 by Geyan Ye, Man Tat Alexander Ng, Wei Liu, Zhenyu Wang.

Figure 1
Figure 1. Figure 1: Limitations of existing virtual cell modeling [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the AROMA pipeline. An Augmented Reasoning Over a Multimodal Architecture for virtual [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Structural and sequence encoders in AROMA. A: A pretrained GAT on the Gene-KG and the Path-KG encodes gene-centered subgraphs to obtain structural representations. B: A pretrained ESM-2 model encodes the amino-acid sequence of each gene’s protein to obtain sequence representations. mapped into a unified text representation by Φ(·): Xtext = Φ q, Tdesc, Tpath, Tcell (2) • Gene functional descriptions T desc… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the sentence-level source-tracing analysis. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of the sentence-level biological validity analysis. [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
read the original abstract

Virtual cell modeling predicts molecular state changes under genetic perturbations in silico, which is essential for biological mechanism studies. However, existing approaches suffer from unconstrained reasoning, uninterpretable predictions, and retrieval signals that are weakly aligned with regulatory topology. To address these limitations, we propose AROMA, an Augmented Reasoning Over a Multimodal Architecture for virtual cell genetic perturbation modeling. AROMA integrates textual evidence, graph-topology information, and protein sequence features to model perturbation-target dependencies, and is trained with a two-stage optimization strategy to yield predictions that are both accurate and interpretable. We also construct two knowledge graphs and a perturbation reasoning dataset, PerturbReason, containing more than 498k samples, as reusable resources for the virtual cell domain. Experiments show that AROMA outperforms existing methods across multiple cell lines, and remains robust under zero-shot evaluation on an unseen cell line, as well as in knowledge-sparse, long-tail scenarios. Overall, AROMA demonstrates that combining knowledge-driven multimodal modeling with evidence retrieval provides a promising pathway toward more reliable and interpretable virtual cell perturbation prediction. Model weights are available at https://huggingface.co/blazerye/AROMA. Code is available at https://github.com/blazerye/AROMA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces AROMA, a multimodal architecture for virtual cell genetic perturbation modeling that fuses textual evidence, graph-topology signals, and protein sequence features. It employs a two-stage optimization procedure and contributes two new knowledge graphs plus the PerturbReason dataset (498k samples). The central empirical claim is that AROMA outperforms prior methods on multiple cell lines, remains robust in zero-shot transfer to an unseen cell line, and handles knowledge-sparse long-tail cases while producing interpretable predictions.

Significance. If the performance and robustness claims are substantiated by independent quantitative evidence, the work would supply reusable resources (KGs and dataset) and demonstrate a concrete route toward more reliable, knowledge-aligned virtual-cell models. The public release of model weights and code is a clear strength that facilitates follow-up.

major comments (2)
  1. Abstract: the statements that AROMA 'outperforms existing methods across multiple cell lines' and 'remains robust under zero-shot evaluation on an unseen cell line' are presented without any numerical metrics, baseline names, error bars, or ablation results. Because these empirical claims are load-bearing for the paper's contribution, their absence prevents assessment of whether the two-stage strategy actually delivers the asserted gains.
  2. Dataset and knowledge-graph construction (implicit in §3 and Experiments): both the PerturbReason dataset and the two author-constructed KGs are built specifically for this study. The central claim of genuine alignment with regulatory topology therefore requires explicit checks (e.g., leakage analysis, topology-independent hold-outs, or external validation) that are not supplied in the provided text; without them the reported advantages risk being artifacts of the authors' own data-construction choices rather than improved perturbation modeling.
minor comments (1)
  1. The abstract mentions availability of model weights and code but does not indicate whether the released repository contains the exact data-construction scripts and hyper-parameter settings used for the reported experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment point by point below. Where the comments identify gaps in the current manuscript, we have prepared revisions to incorporate the requested evidence and clarifications.

read point-by-point responses
  1. Referee: Abstract: the statements that AROMA 'outperforms existing methods across multiple cell lines' and 'remains robust under zero-shot evaluation on an unseen cell line' are presented without any numerical metrics, baseline names, error bars, or ablation results. Because these empirical claims are load-bearing for the paper's contribution, their absence prevents assessment of whether the two-stage strategy actually delivers the asserted gains.

    Authors: We agree that the abstract would be strengthened by quantitative support for these central claims. In the revised manuscript we have updated the abstract to include specific performance metrics (e.g., average AUC improvements and robustness scores across cell lines), the names of the primary baselines, references to error bars from the main experiments, and a brief note on the ablation results that isolate the contribution of the two-stage optimization. These additions are drawn directly from the results already reported in Sections 4 and 5 and fit within the abstract length constraints. revision: yes

  2. Referee: Dataset and knowledge-graph construction (implicit in §3 and Experiments): both the PerturbReason dataset and the two author-constructed KGs are built specifically for this study. The central claim of genuine alignment with regulatory topology therefore requires explicit checks (e.g., leakage analysis, topology-independent hold-outs, or external validation) that are not supplied in the provided text; without them the reported advantages risk being artifacts of the authors' own data-construction choices rather than improved perturbation modeling.

    Authors: We acknowledge that the current text does not provide the explicit validation checks requested. In the revised manuscript we have added a dedicated subsection (now §3.4) that reports: (i) a leakage analysis confirming no shared perturbation targets or regulatory edges between training and test partitions, (ii) performance under topology-independent hold-out splits that remove entire regulatory subgraphs, and (iii) external validation of the constructed KGs against independent sources (STRING, Reactome, and curated perturbation databases). These checks support that the observed gains arise from the multimodal architecture and two-stage training rather than from data-construction artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or evaluation chain

full rationale

The paper constructs two knowledge graphs and the PerturbReason dataset (498k samples) explicitly as reusable resources for the virtual cell domain, then trains AROMA via multimodal integration (textual evidence, graph topology, protein sequences) and two-stage optimization. Outperformance claims, zero-shot robustness on an unseen cell line, and long-tail handling are presented as experimental results on held-out splits of these resources. No equation or step reduces a prediction to the construction inputs by definition, no fitted parameter is relabeled as an independent prediction, and no load-bearing self-citation or uniqueness theorem is invoked. Standard practice for introducing a new benchmark and method does not constitute circularity when the architecture choices and optimization strategy retain independent content, and public code/weights enable external checks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that the authors' newly constructed knowledge graphs and PerturbReason dataset faithfully encode regulatory relationships, that the multimodal fusion captures perturbation-target dependencies, and that the two-stage optimization yields interpretable outputs; none of these are independently verified outside the paper.

free parameters (1)
  • Two-stage optimization hyperparameters
    The training procedure balances accuracy and interpretability objectives; specific weighting and scheduling parameters are fitted during development.
axioms (1)
  • domain assumption The textual evidence, graph-topology information, and protein sequence features are complementary and accurately aligned with biological regulatory topology.
    The architecture assumes these three modalities together suffice to model perturbation dependencies without major missing signals.
invented entities (2)
  • PerturbReason dataset no independent evidence
    purpose: Large-scale training and evaluation resource containing >498k perturbation-reasoning samples.
    Newly constructed by the authors as a reusable community asset.
  • Two knowledge graphs no independent evidence
    purpose: Provide graph-topology signals for the multimodal model.
    Constructed specifically for this work to supply regulatory structure.

pith-pipeline@v0.9.0 · 5541 in / 1596 out tokens · 43582 ms · 2026-05-09T23:04:23.312617+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

108 extracted references · 14 canonical work pages · 7 internal anchors

  1. [1]

    arXiv e-prints , pages=

    The llama 3 herd of models , author=. arXiv e-prints , pages=

  2. [5]

    Nature Biotechnology , volume=

    Predicting transcriptional outcomes of novel multigene perturbations with GEARS , author=. Nature Biotechnology , volume=. 2024 , publisher=

  3. [6]

    bioRxiv , pages=

    GenePT: a simple but effective foundation model for genes and cells built from ChatGPT , author=. bioRxiv , pages=

  4. [7]

    Nature methods , volume=

    scGPT: toward building a foundation model for single-cell multi-omics using generative AI , author=. Nature methods , volume=. 2024 , publisher=

  5. [9]

    Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

    Annotation-guided protein design with multi-level domain alignment , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1 , pages=

  6. [11]

    bioRxiv , pages=

    rbio1-training scientific reasoning LLMs with biological world models as soft verifiers , author=. bioRxiv , pages=. 2025 , publisher=

  7. [13]

    Nature methods , volume=

    Large-scale foundation model on single-cell transcriptomics , author=. Nature methods , volume=. 2024 , publisher=

  8. [15]

    Science , volume=

    Evolutionary-scale prediction of atomic-level protein structure with a language model , author=. Science , volume=. 2023 , publisher=

  9. [16]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

  10. [17]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024 , author=. URL https://arxiv. org/abs/2402.03300 , volume=

  11. [18]

    Cell , volume=

    A whole-cell computational model predicts phenotype from genotype , author=. Cell , volume=. 2012 , publisher=

  12. [19]

    Cell , volume=

    How to build the virtual cell with artificial intelligence: Priorities and opportunities , author=. Cell , volume=. 2024 , publisher=

  13. [20]

    ACM computing surveys , volume=

    Survey of hallucination in natural language generation , author=. ACM computing surveys , volume=. 2023 , publisher=

  14. [21]

    Molecular systems biology , volume=

    Predicting cellular responses to complex perturbations in high-throughput screens , author=. Molecular systems biology , volume=

  15. [22]

    Briefings in Bioinformatics , volume=

    Drugassist: A large language model for molecule optimization , author=. Briefings in Bioinformatics , volume=. 2025 , publisher=

  16. [23]

    bioRxiv , pages=

    CellFlow enables generative single-cell phenotype modeling with flow matching , author=. bioRxiv , pages=. 2025 , publisher=

  17. [24]

    arXiv preprint arXiv:2508.02276 , year=

    Cellforge: Agentic design of virtual cell models , author=. arXiv preprint arXiv:2508.02276 , year=

  18. [26]

    arXiv preprint arXiv:2411.04863 , year=

    OneProt: Towards multi-modal protein foundation models , author=. arXiv preprint arXiv:2411.04863 , year=

  19. [27]

    The Thirteenth International Conference on Learning Representations , year=

    Atomas: Hierarchical adaptive alignment on molecule-text for unified molecule understanding and generation , author=. The Thirteenth International Conference on Learning Representations , year=

  20. [28]

    BioRxiv , pages=

    Procyon: A multimodal foundation model for protein phenotypes , author=. BioRxiv , pages=. 2024 , publisher=

  21. [29]

    bioRxiv , pages=

    CAPTAIN: A multimodal foundation model pretrained on co-assayed single-cell RNA and protein , author=. bioRxiv , pages=. 2025 , publisher=

  22. [30]

    IEEE transactions on pattern analysis and machine intelligence , volume=

    Prottrans: Toward understanding the language of life through self-supervised learning , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2021 , publisher=

  23. [31]

    Nature Methods , volume=

    Nucleotide Transformer: building and evaluating robust foundation models for human genomics , author=. Nature Methods , volume=. 2025 , publisher=

  24. [32]

    BioRxiv , pages=

    Genome modeling and design across all domains of life with Evo 2 , author=. BioRxiv , pages=. 2025 , publisher=

  25. [33]

    Nucleic acids research , volume=

    STRING: a database of predicted functional associations between proteins , author=. Nucleic acids research , volume=. 2003 , publisher=

  26. [34]

    Nature genetics , volume=

    Gene ontology: tool for the unification of biology , author=. Nature genetics , volume=. 2000 , publisher=

  27. [35]

    Nucleic acids research , volume=

    Reactome: a database of reactions, pathways and biological processes , author=. Nucleic acids research , volume=. 2010 , publisher=

  28. [36]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Crossvit: Cross-attention multi-scale vision transformer for image classification , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  29. [37]

    Nucleic acids research , volume=

    CORUM: the comprehensive resource of mammalian protein complexes—2019 , author=. Nucleic acids research , volume=. 2019 , publisher=

  30. [38]

    International journal of cancer , volume=

    K562—a human erythroleukemic cell line , author=. International journal of cancer , volume=. 1979 , publisher=

  31. [39]

    International journal of molecular sciences , volume=

    The curious case of the HepG2 cell line: 40 years of expertise , author=. International journal of molecular sciences , volume=. 2021 , publisher=

  32. [40]

    BMC genomics , volume=

    A genome-wide survey of mutations in the Jurkat cell line , author=. BMC genomics , volume=. 2018 , publisher=

  33. [41]

    FEBS open bio , volume=

    Evidence for reciliation of RPE1 cells in late G1 phase, and ciliary localisation of cyclin B1 , author=. FEBS open bio , volume=. 2013 , publisher=

  34. [42]

    Catalogue of artificial intelligence tools , pages=

    Breadth-first search , author=. Catalogue of artificial intelligence tools , pages=. 1984 , publisher=

  35. [43]

    ChatGPT o4-mini , author=

  36. [44]

    Building Generative AI Agents: Using LangGraph, AutoGen, and CrewAI , pages=

    OpenAI GPTs and the Assistants API , author=. Building Generative AI Agents: Using LangGraph, AutoGen, and CrewAI , pages=. 2025 , publisher=

  37. [45]

    2025 , month =

    GPT-5 System Card , howpublished =. 2025 , month =

  38. [46]

    2025 , month =

    OpenAI o3 and o4-mini System Card , howpublished =. 2025 , month =

  39. [47]

    2025 , month =

    Gemini 2.5 Pro: Generative AI on Vertex AI , howpublished =. 2025 , month =

  40. [49]

    BioRxiv , pages=

    Predicting cellular responses to perturbation across diverse contexts with State , author=. BioRxiv , pages=. 2025 , publisher=

  41. [50]

    arXiv preprint arXiv:2412.02565 , year=

    SJTU: Spatial judgments in multimodal models towards unified segmentation through coordinate detection , author=. arXiv preprint arXiv:2412.02565 , year=

  42. [51]

    Medical reference services quarterly , volume=

    PubMed 2.0 , author=. Medical reference services quarterly , volume=. 2020 , publisher=

  43. [53]

    Discover Oncology , volume=

    Acetyl-CoA synthetase 2 (ACSS2): a review with a focus on metabolism and tumor development , author=. Discover Oncology , volume=. 2022 , publisher=

  44. [54]

    Bioinformatics , volume=

    Cross-dependent graph neural networks for molecular property prediction , author=. Bioinformatics , volume=. 2022 , publisher=

  45. [55]

    Cell , volume=

    Acetate dependence of tumors , author=. Cell , volume=. 2014 , publisher=

  46. [56]

    Frontiers in Physiology , volume=

    Acetate revisited: A key biomolecule at the nexus of metabolism, epigenetics and oncogenesis—Part 1: Acetyl-CoA, acetogenesis and acyl-CoA short-chain synthetases , author=. Frontiers in Physiology , volume=. 2020 , publisher=

  47. [57]

    Biochemistry, electron transport chain , author=

  48. [58]

    Nature communications , volume=

    Acetate functions as an epigenetic metabolite to promote lipid synthesis under hypoxia , author=. Nature communications , volume=. 2016 , publisher=

  49. [59]

    Molecular Metabolism , volume=

    Acetate drives ovarian cancer quiescence via ACSS2-mediated acetyl-CoA production , author=. Molecular Metabolism , volume=. 2024 , publisher=

  50. [60]

    Current medicinal chemistry , volume=

    Mitochondrial respiratory complex I: structure, function and implication in human diseases , author=. Current medicinal chemistry , volume=. 2009 , publisher=

  51. [61]

    American Journal of Physiology-Regulatory, Integrative and Comparative Physiology , volume=

    Free radical biology and medicine: it's a gas, man! , author=. American Journal of Physiology-Regulatory, Integrative and Comparative Physiology , volume=. 2006 , publisher=

  52. [62]

    Journal of Experimental Medicine , volume=

    ACLY and ACSS2 link nutrient-dependent chromatin accessibility to CD8 T cell effector responses , author=. Journal of Experimental Medicine , volume=. 2024 , publisher=

  53. [63]

    Briefings in bioinformatics , volume=

    Multimodal deep learning for biomedical data fusion: a review , author=. Briefings in bioinformatics , volume=. 2022 , publisher=

  54. [64]

    Nature medicine , volume=

    Multimodal biomedical AI , author=. Nature medicine , volume=. 2022 , publisher=

  55. [65]

    Genetics , volume=

    A review of multimodal deep learning methods for genomic-enabled prediction in plant breeding , author=. Genetics , volume=. 2024 , publisher=

  56. [66]

    Eukaryotic cell , volume=

    ND3 and ND4L subunits of mitochondrial complex I, both nucleus encoded in Chlamydomonas reinhardtii, are required for activity and assembly of the enzyme , author=. Eukaryotic cell , volume=. 2006 , publisher=

  57. [67]

    Abhinav K Adduri, Dhruv Gautam, Beatrice Bevilacqua, Alishba Imran, Rohan Shah, Mohsen Naghipourfar, Noam Teyssier, Rajesh Ilango, Sanjay Nagaraj, Mingze Dong, and 1 others. 2025. Predicting cellular responses to perturbation across diverse contexts with state. BioRxiv, pages 2025--06

  58. [68]

    Maria Ahmad, Adam Wolberg, and Chadi I Kahwaji. 2018. Biochemistry, electron transport chain

  59. [69]

    Leif C Andersson, Kenneth Nilsson, and Carl G Gahmberg. 1979. K562—a human erythroleukemic cell line. International journal of cancer, 23(2):143--147

  60. [70]

    Viktoriia A Arzumanian, Olga I Kiseleva, and Ekaterina V Poverennaya. 2021. The curious case of the hepg2 cell line: 40 years of expertise. International journal of molecular sciences, 22(23):13135

  61. [71]

    Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, and 1 others. 2000. Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25--29

  62. [72]

    Alan Bundy and Lincoln Wallen. 1984. Breadth-first search. In Catalogue of artificial intelligence tools, pages 13--13. Springer

  63. [73]

    Charlotte Bunne, Yusuf Roohani, Yanay Rosen, Ankit Gupta, Xikun Zhang, Marcel Roed, Theo Alexandrov, Mohammed AlQuraishi, Patricia Brennan, Daniel B Burkhardt, and 1 others. 2024. How to build the virtual cell with artificial intelligence: Priorities and opportunities. Cell, 187(25):7045--7063

  64. [74]

    Pierre Cardol, Marie Lapaille, Pierre Minet, Fabrice Franck, Ren \'e F Matagne, and Claire Remacle. 2006. Nd3 and nd4l subunits of mitochondrial complex i, both nucleus encoded in chlamydomonas reinhardtii, are required for activity and assembly of the enzyme. Eukaryotic cell, 5(9):1460--1467

  65. [75]

    Yiqun Chen and James Zou. 2024. Genept: a simple but effective foundation model for genes and cells built from chatgpt. bioRxiv, pages 2023--10

  66. [76]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261

  67. [77]

    Sarah A Comerford, Zhiguang Huang, Xinlin Du, Yun Wang, Ling Cai, Agnes K Witkiewicz, Holly Walters, Mohammed N Tantawy, Allie Fu, H Charles Manning, and 1 others. 2014. Acetate dependence of tumors. Cell, 159(7):1591--1602

  68. [78]

    David Croft, Gavin O’kelly, Guanming Wu, Robin Haw, Marc Gillespie, Lisa Matthews, Michael Caudy, Phani Garapati, Gopal Gopinath, Bijay Jassal, and 1 others. 2010. Reactome: a database of reactions, pathways and biological processes. Nucleic acids research, 39(suppl\_1):D691--D697

  69. [79]

    Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. 2024. scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nature methods, 21(8):1470--1480

  70. [80]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv e-prints, pages arXiv--2407

  71. [81]

    Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, and 1 others. 2021. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 44(10):7112--7127

  72. [82]

    Adibvafa Fallahpour, Andrew Magnuson, Purav Gupta, Shihao Ma, Jack Naimer, Arnav Shah, Haonan Duan, Omar Ibrahim, Hani Goodarzi, Chris J Maddison, and 1 others. 2025. Bioreason: Incentivizing multimodal biological reasoning within a dna-llm model. arXiv preprint arXiv:2505.23579

  73. [83]

    Xue Gao, Shu-Hai Lin, Feng Ren, Jin-Tao Li, Jia-Jia Chen, Chuan-Bo Yao, Hong-Bin Yang, Shu-Xia Jiang, Guo-Quan Yan, Di Wang, and 1 others. 2016. Acetate functions as an epigenetic metabolite to promote lipid synthesis under hypoxia. Nature communications, 7(1):11960

  74. [84]

    Louis Gioia, Azeem Siddique, Steven R Head, Daniel R Salomon, and Andrew I Su. 2018. A genome-wide survey of mutations in the jurkat cell line. BMC genomics, 19(1):334

  75. [85]

    Madalina Giurgiu, Julian Reinhard, Barbara Brauner, Irmtraud Dunger-Kaltenbach, Gisela Fobo, Goar Frishman, Corinna Montrone, and Andreas Ruepp. 2019. Corum: the comprehensive resource of mammalian protein complexes—2019. Nucleic acids research, 47(D1):D559--D563

  76. [86]

    Google Cloud . 2025. Gemini 2.5 pro: Generative ai on vertex ai. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro?hl=zh-cn. Accessed: 2025-12-21

  77. [87]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

  78. [88]

    Minsheng Hao, Jing Gong, Xin Zeng, Chiming Liu, Yucheng Guo, Xingyi Cheng, Taifeng Wang, Jianzhu Ma, Xuegong Zhang, and Le Song. 2024. Large-scale foundation model on single-cell transcriptomics. Nature methods, 21(8):1481--1491

  79. [89]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

  80. [90]

    Ana-Maria Istrate, Fausto Milletari, Fabrizio Castrotorres, Jakub M Tomczak, Michaela Torkar, Donghui Li, and Theofanis Karaletsos. 2025. rbio1-training scientific reasoning llms with biological world models as soft verifiers. bioRxiv, pages 2025--08

Showing first 80 references.