pith. machine review for the scientific record. sign in

arxiv: 2605.10876 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI· q-bio.QM

Recognition: no theorem link

AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

Alexander Wu, Carl Edwards, David Richmond, Edward De Brouwer, Ehsan Hajiramezanali, Gabriele Scalia, Graham Heimberg, Jan-Christian H\"utter, Jenna Collier, Meena Subramaniam, Sara Mostafavi, Xiner Li

Pith reviewed 2026-05-12 04:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIq-bio.QM
keywords AssayBenchCRISPR screensphenotypic screeninggene rank predictionvirtual cellLLM benchmarkadjusted nDCGin silico screening
0
0 comments X

The pith

Zero-shot generalist LLMs outperform biology-specific models and baselines on a benchmark of 1,920 CRISPR screens for gene-rank prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds AssayBench from 1,920 public CRISPR screens across five phenotype classes and turns each screen into a gene-ranking task scored by adjusted nDCG. It then tests whether current LLMs and agents can predict which genes would produce a given cellular phenotype when perturbed. A sympathetic reader cares because this setup directly targets the in-silico phenotypic screen that would let a virtual cell model replace many wet-lab experiments. The results show that off-the-shelf generalist LLMs already beat both domain-specific LLMs and trainable baselines in the zero-shot setting, while fine-tuning, ensembling, and prompt tuning raise performance further, though all methods remain well below an estimated performance ceiling.

Core claim

AssayBench formulates phenotypic screen prediction as gene-rank prediction on 1,920 CRISPR assays and introduces adjusted nDCG to compare results across heterogeneous readouts. Zero-shot generalist LLMs exceed biology-specific LLMs and trainable baselines on this task; optimization techniques such as fine-tuning, ensembling, and prompt optimization raise scores further, yet all evaluated methods stay far from the empirically estimated ceiling.

What carries the argument

AssayBench, which converts each CRISPR screen into a gene-ranking problem scored by adjusted nDCG to enable consistent comparison across diverse phenotypic assays.

If this is right

  • LLMs become a practical starting point for in-silico phenotypic screens that could guide drug-discovery workflows.
  • AssayBench supplies a concrete, public yardstick for measuring incremental progress toward virtual-cell models.
  • Optimization methods already known for language tasks transfer directly to this biological ranking setting.
  • Continued improvement on the benchmark would narrow the gap between current models and the performance ceiling observed in the data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If generalist models keep improving on AssayBench, they may reduce the number of preliminary wet-lab screens required before moving to targeted experiments.
  • The benchmark could be extended to other perturbation modalities such as small-molecule or genetic screens once suitable public data become available.
  • Success on gene ranking does not yet prove the model understands causal mechanisms, so follow-up work would need to test whether top-ranked genes actually produce the predicted phenotype in new contexts.

Load-bearing premise

That turning these 1,920 public CRISPR screens into gene-rank prediction tasks with adjusted nDCG gives a faithful stand-in for the full phenotypic-screening capability a working virtual cell would need.

What would settle it

A new, held-out CRISPR screen performed in an unseen cell type or condition where the genes ranked highest by the best AssayBench model show no measurable phenotypic effect when actually perturbed.

Figures

Figures reproduced from arXiv: 2605.10876 by Alexander Wu, Carl Edwards, David Richmond, Edward De Brouwer, Ehsan Hajiramezanali, Gabriele Scalia, Graham Heimberg, Jan-Christian H\"utter, Jenna Collier, Meena Subramaniam, Sara Mostafavi, Xiner Li.

Figure 1
Figure 1. Figure 1: Overview of the ASSAYBENCH benchmark creation. (A) Starting from 1971 human CRISPR screens, we perform data quality filtering, replicate merging, and data augmentation to obtain 1920 high quality screens. (B) Phenotype composition of the database and its four splits. A realistic but challenging temporal split was used. (C) Given a description of the screen and a gene ranking criteria, a model must provide … view at source ↗
Figure 2
Figure 2. Figure 2: (Left) AnDCG@k on the main models, colored by model category. (Right) Comparison of GEMINI 3 PRO performance with a technical replicate baselines (5.2). (N = 32 techincal replicate screens). 5.2 Top-performing models remain far from the performance ceiling Biology is inherently stochastic and experiments introduce further technical variability, raising the question of what performance ceiling can reasonabl… view at source ↗
Figure 3
Figure 3. Figure 3: stratifies AnDCG@100 of different models by phenotype. Predictive performance was highest for the viability screens, likely because their hit genes are enriched for conserved cellular dependencies that recur across screens. This recurrence also potentially explains why the phenotype￾based frequency baseline is particularly strong for this class of screens. Other phenotypes, such as host-pathogen response o… view at source ↗
Figure 4
Figure 4. Figure 4: (Left) Scaling trend analysis on the Qwen3.5 family. Larger models lead to higher [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Biological bias of differ￾ent models across gene sets. ASSAYBENCH is, to our knowledge, the first large-scale bench￾mark for phenotypic screen prediction. It also provides a testbed for evaluating LLMs and agents as surrogates for virtual cells, supporting progress in this area. A key design choice is to cast each assay as a single ranking problem rather than issuing one query per gene: with an average of … view at source ↗
Figure 6
Figure 6. Figure 6: Validation vs. test performance of ensemble functions produced by Bayesian Optimization [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qwen ∼ 8B model performance over time. D.2 Per-Cohort Results for All Benchmarks [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
read the original abstract

Recent advances in machine learning and large-scale biological data collections have revived the prospect of building a virtual cell, a computational model of cellular behavior that could accelerate biological discovery. One of the most compelling promises of this vision is the ability to perform in silico phenotypic screens, in which a model predicts the effects of cellular perturbations in unseen biological contexts. This task combines heterogeneous textual inputs with diverse phenotypic outputs, making it particularly well-suited to LLMs and agentic systems. Yet, no standard benchmark currently exists for this task, as existing efforts focus on narrower molecular readouts that are only indirectly aligned with the phenotypic endpoints driving many real-world drug discovery workflows. In this work, we present AssayBench, a benchmark for phenotypic screen prediction, built from 1,920 publicly available CRISPR screens spanning five broad classes of cellular phenotypes. We formulate the screen prediction task as a gene rank prediction for each screen and introduce the adjusted nDCG, a continuous metric for comparing performance across heterogeneous assays. Our extensive evaluation shows that existing methods remain far from empirically estimated performance ceilings and zero-shot generalist LLMs outperform biology-specific LLMs and trainable baselines. Optimization techniques such as fine-tuning, ensembling, and prompt optimization can further improve LLM performance on this task. Overall, AssayBench offers a practical testbed for measuring progress toward in silico phenotypic screening and, more broadly, virtual cell models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AssayBench, a benchmark for phenotypic screen prediction in virtual cell models, constructed from 1,920 publicly available CRISPR screens spanning five phenotype classes. It formulates the task as per-assay gene ranking and introduces adjusted nDCG as the evaluation metric. Extensive experiments show that zero-shot generalist LLMs outperform biology-specific LLMs and trainable baselines on this task, with further gains possible via fine-tuning, ensembling, and prompt optimization, while all methods remain far from empirically estimated performance ceilings.

Significance. If the benchmark is shown to be a valid proxy, this work would provide a much-needed standardized testbed for in silico phenotypic screening, a key capability for virtual cell models in drug discovery. The use of public data, the introduction of adjusted nDCG for heterogeneous assays, and the clear performance ordering are strengths that could guide future LLM and agent development in biology. The empirical focus on phenotypic rather than molecular endpoints aligns well with real-world needs.

major comments (2)
  1. [Abstract] Abstract: The central claim that AssayBench offers a 'practical testbed for measuring progress toward in silico phenotypic screening' and virtual cell models rests on the assumption that gene-rank prediction with adjusted nDCG on these 1,920 CRISPR screens is a faithful proxy for the broader task of predicting perturbation effects in unseen contexts with heterogeneous phenotypic outputs. No validation, correlation analysis, or downstream utility study is provided to show that performance on this ranking task predicts real phenotypic screen outcomes.
  2. [Evaluation] Evaluation and Methods sections: The reported outperformance of zero-shot generalist LLMs and the performance gaps to ceilings cannot be fully assessed without explicit details on screen selection criteria, data splits to avoid leakage across the 1,920 assays, and the exact computation of adjusted nDCG (including any hyperparameters or post-processing). This makes it impossible to rule out post-hoc tuning or verify robustness of the ordering.
minor comments (2)
  1. [Abstract] The abstract introduces 'adjusted nDCG' without a one-sentence definition or reference, which reduces accessibility for readers outside information retrieval.
  2. A summary table listing the five phenotype classes, number of screens per class, and example endpoints would improve clarity on the benchmark's diversity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and made revisions to improve clarity and address concerns regarding the benchmark's validity and reproducibility.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that AssayBench offers a 'practical testbed for measuring progress toward in silico phenotypic screening' and virtual cell models rests on the assumption that gene-rank prediction with adjusted nDCG on these 1,920 CRISPR screens is a faithful proxy for the broader task of predicting perturbation effects in unseen contexts with heterogeneous phenotypic outputs. No validation, correlation analysis, or downstream utility study is provided to show that performance on this ranking task predicts real phenotypic screen outcomes.

    Authors: We agree that a direct empirical validation linking performance on AssayBench to downstream real-world outcomes would strengthen the case for it as a proxy. Our work establishes AssayBench as a benchmark derived directly from 1,920 real phenotypic CRISPR screens, where the task is to rank genes according to their observed effects on the phenotype. This formulation captures the essence of in silico phenotypic screening. We have partially revised the manuscript by qualifying the language in the abstract and adding a new subsection in the Discussion that explicitly discusses the assumptions underlying the proxy, acknowledges the lack of correlation studies, and outlines plans for future work to validate against additional experimental data. revision: partial

  2. Referee: [Evaluation] Evaluation and Methods sections: The reported outperformance of zero-shot generalist LLMs and the performance gaps to ceilings cannot be fully assessed without explicit details on screen selection criteria, data splits to avoid leakage across the 1,920 assays, and the exact computation of adjusted nDCG (including any hyperparameters or post-processing). This makes it impossible to rule out post-hoc tuning or verify robustness of the ordering.

    Authors: We thank the referee for highlighting the need for greater transparency. We have revised the Methods and Evaluation sections to provide explicit details: screen selection criteria include requirements for sufficient statistical power (e.g., number of guides, replicates) and coverage across the five phenotype classes from public sources; data splits are performed at the assay level using stratified random partitioning with no shared genes or biological contexts between splits to prevent leakage; the adjusted nDCG formula is now fully specified with the adjustment term for varying assay lengths and relevance scores based on differential expression statistics, along with all hyperparameters and post-processing steps. An appendix with implementation details and additional ablation studies has been added to verify the robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark on external public data with independent performance comparisons

full rationale

The paper defines AssayBench from 1,920 publicly available external CRISPR screens, formulates the task as gene-rank prediction, and introduces adjusted nDCG as a new metric for heterogeneous assays. Central claims consist of empirical comparisons showing zero-shot generalist LLMs outperforming biology-specific LLMs and trainable baselines on held-out screens, with further gains from fine-tuning, ensembling, and prompt optimization. No derivation reduces by construction to fitted internal parameters, self-citations, or tautological renaming; the evaluation is self-contained against external benchmarks and does not invoke uniqueness theorems or load-bearing prior results from the same authors to justify the core findings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard assumptions that public CRISPR screens are reliable phenotypic readouts and that gene-rank prediction is a suitable proxy for phenotypic screening; no new free parameters, axioms, or invented entities are introduced beyond ordinary benchmark construction.

pith-pipeline@v0.9.0 · 5594 in / 1203 out tokens · 28247 ms · 2026-05-12T04:43:33.080712+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · 12 internal anchors

  1. [1]

    Transportation Research Record: Journal of the Transportation Research Board , number=

    Theoretical maximum capacity as benchmark for empty vehicle redistribution in personal rapid transit , author=. Transportation Research Record: Journal of the Transportation Research Board , number=. 2010 , publisher=

  2. [4]

    Genome-wide single-cell perturbation screens with viperturb-seq

    Alexandra Bradu, John D Blair, Isabella N Grabski, Isabella Mascio, Junsuk Lee, Cecilia McCormick, and Rahul Satija. Genome-wide single-cell perturbation screens with viperturb-seq. bioRxiv, pages 2026--02, 2026

  3. [5]

    How to build the virtual cell with artificial intelligence: Priorities and opportunities

    Charlotte Bunne, Yusuf Roohani, Yanay Rosen, Ankit Gupta, Xikun Zhang, Marcel Roed, Theo Alexandrov, Mohammed AlQuraishi, Patricia Brennan, Daniel B Burkhardt, et al. How to build the virtual cell with artificial intelligence: Priorities and opportunities. Cell, 187 0 (25): 0 7045--7063, 2024

  4. [6]

    Rational design of synthetic proteins using a genome-scale crispr screen

    Wells H Burrell, Simon J Mueller, Zharko Daniloski, P Duffy Doyle Jr, Anne B Rovsing, Christopher James, Max Drabkin, Chien-Yu Chou, Hei Yu Annika So, Lyla Katgara, et al. Rational design of synthetic proteins using a genome-scale crispr screen. bioRxiv, pages 2026--02, 2026

  5. [7]

    Reciprocal rank fusion outperforms condorcet and individual rank learning methods

    Gordon V Cormack, Charles LA Clarke, and Stefan Buettcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 758--759, 2009

  6. [8]

    scgpt: toward building a foundation model for single-cell multi-omics using generative ai

    Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nature methods, 21 0 (8): 0 1470--1480, 2024

  7. [9]

    Systematic discovery of crispr-boosted car t cell immunotherapies

    Paul Datlinger, Eugenia V Pankevich, Cosmas D Arnold, Nicole Pranckevicius, Jenny Lin, Daria Romanovskaia, Moritz Schaefer, Francesco Piras, Anne-Christine Orts, Amelie Nemc, et al. Systematic discovery of crispr-boosted car t cell immunotherapies. Nature, 646 0 (8086): 0 963--972, 2025

  8. [10]

    A new era of intelligence with gemini 3, Nov 2025

    Google. A new era of intelligence with gemini 3, Nov 2025. URL https://blog.google/products-and-platforms/products/gemini/gemini-3/

  9. [11]

    Biomni: A general-purpose biomedical ai agent

    Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent. biorxiv, 2025

  10. [13]

    Virus-like particles enable targeted gene engineering and pooled crispr screening in primary human myeloid cells

    Hyuncheol Jung, Pascal Devant, Carter Ching, Mineto Ota, Jennifer Hamilton, Zachary Steinhart, Wayne Ngo, Luis Sandoval, Jae Hyung Jung, Da Xu, et al. Virus-like particles enable targeted gene engineering and pooled crispr screening in primary human myeloid cells. bioRxiv, pages 2025--12, 2025

  11. [15]

    Gene-embedding-based prediction and functional evaluation of perturbation expression responses with presage

    Russell Littman, Jacob Levine, Sepideh Maleki, Yongju Lee, Vladimir Ermakov, Lin Qiu, Alexander Wu, Kexin Huang, Romain Lopez, Gabriele Scalia, et al. Gene-embedding-based prediction and functional evaluation of perturbation expression responses with presage. bioRxiv, pages 2025--06, 2025

  12. [18]

    The biogrid database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions

    Rose Oughtred, Jennifer Rust, Christie Chang, Bobby-Joe Breitkreutz, Chris Stark, Andrew Willems, Lorrie Boucher, Genie Leung, Nadine Kolas, Frederick Zhang, et al. The biogrid database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Science, 30 0 (1): 0 187--200, 2021

  13. [19]

    scperturb: harmonized single-cell perturbation data

    Stefan Peidli, Tessa D Green, Ciyue Shen, Torsten Gross, Joseph Min, Samuele Garda, Bo Yuan, Linus J Schumacher, Jake P Taylor-King, Debora S Marks, et al. scperturb: harmonized single-cell perturbation data. Nature Methods, 21 0 (3): 0 531--540, 2024

  14. [21]

    Disentanglement of single-cell data with biolord

    Zoe Piran, Niv Cohen, Yedid Hoshen, and Mor Nitzan. Disentanglement of single-cell data with biolord. Nature Biotechnology, 42 0 (11): 0 1678--1683, 2024

  15. [22]

    Qwen3.5 : Towards native multimodal agents, February 2026

    Qwen Team . Qwen3.5 : Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

  16. [23]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First conference on language modeling, 2024

  17. [24]

    Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq

    Joseph M Replogle, Reuben A Saunders, Angela N Pogson, Jeffrey A Hussmann, Alexander Lenail, Alina Guna, Lauren Mascibroda, Eric J Wagner, Karen Adelman, Gila Lithwick-Yanai, et al. Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq. Cell, 185 0 (14): 0 2559--2575, 2022

  18. [25]

    Scaling large language models for next-generation single-cell analysis

    Syed Asad Rizvi, Daniel Levine, Aakash Patel, Shiyang Zhang, Eric Wang, Curtis Jamison Perry, Ivan Vrkic, Nicole Mayerli Constante, Zirui Fu, Sizhuang He, et al. Scaling large language models for next-generation single-cell analysis. BioRxiv, pages 2025--04, 2026

  19. [26]

    Predicting transcriptional outcomes of novel multigene perturbations with gears

    Yusuf Roohani, Kexin Huang, and Jure Leskovec. Predicting transcriptional outcomes of novel multigene perturbations with gears. Nature Biotechnology, 42 0 (6): 0 927--935, 2024

  20. [27]

    Virtual cell challenge: Toward a turing test for the virtual cell

    Yusuf H Roohani, Tony J Hua, Po-Yuan Tung, Lexi R Bounds, Feiqiao B Yu, Alexander Dobin, Noam Teyssier, Abhinav Adduri, Alden Woodrow, Brian S Plosky, et al. Virtual cell challenge: Toward a turing test for the virtual cell. Cell, 188 0 (13): 0 3370--3374, 2025

  21. [29]

    Openevolve: an open-source evolutionary coding agent, 2025

    Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025. URL https://github.com/algorithmicsuperintelligence/openevolve

  22. [31]

    Virtual crispr: Can llms predict crispr screen results? In Proceedings of the 24th Workshop on Biomedical Language Processing, pages 354--364, 2025

    Steven Song, Abdalla Abdrabou, Asmita Dabholkar, Kastan Day, Pavan Dharmoju, Jason Perera, Volodymyr Kindratenko, and Aly Khan. Virtual crispr: Can llms predict crispr screen results? In Proceedings of the 24th Workshop on Biomedical Language Processing, pages 354--364, 2025

  23. [32]

    Rxrx1: A dataset for evaluating experimental batch correction methods

    Maciej Sypetkowski, Morteza Rezanejad, Saber Saberian, Oren Kraus, John Urbanik, James Taylor, Ben Mabey, Mason Victors, Jason Yosinski, Alborz Rezazadeh Sereshkeh, et al. Rxrx1: A dataset for evaluating experimental batch correction methods. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4285--4294, 2023

  24. [33]

    Phenotypic drug discovery: recent successes, lessons learned and new directions

    Fabien Vincent, Arsenio Nueda, Jonathan Lee, Monica Schenone, Marco Prunotto, and Mark Mercola. Phenotypic drug discovery: recent successes, lessons learned and new directions. Nature Reviews Drug Discovery, 21 0 (12): 0 899--914, 2022

  25. [34]

    Genetic screens in human cells using the crispr-cas9 system

    Tim Wang, Jenny J Wei, David M Sabatini, and Eric S Lander. Genetic screens in human cells using the crispr-cas9 system. Science, 343 0 (6166): 0 80--84, 2014

  26. [35]

    A theoretical analysis of ndcg type ranking measures

    Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. A theoretical analysis of ndcg type ranking measures. In Conference on learning theory, pages 25--54. PMLR, 2013

  27. [38]

    scbasecount: an ai agent-curated, uniformly processed, and continually expanding single cell data repository

    Nicholas D Youngblut, Christopher Carpenter, Jaanak Prashar, Chiara Ricci-Tam, Rajesh Ilango, Noam Teyssier, Silvana Konermann, Patrick D Hsu, Alexander Dobin, David P Burke, et al. scbasecount: an ai agent-curated, uniformly processed, and continually expanding single cell data repository. bioRxiv, pages 2025--02, 2025

  28. [39]

    Deep sets

    Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. Advances in neural information processing systems, 30, 2017

  29. [41]

    Genome-scale perturb-seq in primary human cd4+ t cells maps context-specific regulators of t cell programs and human immune traits

    Ronghui Zhu, Emma Dann, Jun Yan, Justine Reyes Retana, Ryunosuke Goto, Reese C Guitche, Lillian K Petersen, Mineto Ota, Jonathan K Pritchard, and Alexander Marson. Genome-scale perturb-seq in primary human cd4+ t cells maps context-specific regulators of t cell programs and human immune traits. bioRxiv, pages 2025--12, 2025

  30. [42]

    Cell , volume=

    How to build the virtual cell with artificial intelligence: Priorities and opportunities , author=. Cell , volume=. 2024 , publisher=

  31. [43]

    Nature Methods , volume=

    scPerturb: harmonized single-cell perturbation data , author=. Nature Methods , volume=. 2024 , publisher=

  32. [44]

    Nature Reviews Drug Discovery , volume=

    Phenotypic drug discovery: recent successes, lessons learned and new directions , author=. Nature Reviews Drug Discovery , volume=. 2022 , publisher=

  33. [45]

    Perturbench: Benchmarking machine learning models for cellular perturbation analysis, 2025

    Perturbench: Benchmarking machine learning models for cellular perturbation analysis , author=. arXiv preprint arXiv:2408.10609 , year=

  34. [46]

    ACM Transactions on Information Systems (TOIS) , volume=

    Cumulated gain-based evaluation of IR techniques , author=. ACM Transactions on Information Systems (TOIS) , volume=. 2002 , publisher=

  35. [47]

    Journal of classification , volume=

    Comparing partitions , author=. Journal of classification , volume=. 1985 , publisher=

  36. [48]

    BioRxiv , pages=

    Scaling large language models for next-generation single-cell analysis , author=. BioRxiv , pages=

  37. [49]

    Educational and psychological measurement , volume=

    A coefficient of agreement for nominal scales , author=. Educational and psychological measurement , volume=. 1960 , publisher=

  38. [50]

    Information Processing & Management , volume=

    Joint upper & expected value normalization for evaluation of retrieval systems: A case study with Learning-to-Rank methods , author=. Information Processing & Management , volume=. 2023 , publisher=

  39. [51]

    Conference on learning theory , pages=

    A theoretical analysis of NDCG type ranking measures , author=. Conference on learning theory , pages=. 2013 , organization=

  40. [52]

    Protein Science , volume=

    The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions , author=. Protein Science , volume=. 2021 , publisher=

  41. [53]

    arXiv preprint arXiv:2503.00096 , year=

    Bixbench: a comprehensive benchmark for llm-based agents in computational biology , author=. arXiv preprint arXiv:2503.00096 , year=

  42. [54]

    arXiv preprint arXiv:2503.04013 , year=

    Benchmarking large language models on multiple tasks in bioinformatics nlp with prompting , author=. arXiv preprint arXiv:2503.04013 , year=

  43. [55]

    First conference on language modeling , year=

    Gpqa: A graduate-level google-proof q&a benchmark , author=. First conference on language modeling , year=

  44. [56]

    Humanity's Last Exam

    Humanity's last exam , author=. arXiv preprint arXiv:2501.14249 , year=

  45. [57]

    Proceedings of the 24th Workshop on Biomedical Language Processing , pages=

    Virtual CRISPR: Can LLMs Predict CRISPR Screen Results? , author=. Proceedings of the 24th Workshop on Biomedical Language Processing , pages=

  46. [58]

    arXiv preprint arXiv:2502.21290 , year=

    Contextualizing biological perturbation experiments through language , author=. arXiv preprint arXiv:2502.21290 , year=

  47. [59]

    bioRxiv , pages=

    scBaseCount: an AI agent-curated, uniformly processed, and continually expanding single cell data repository , author=. bioRxiv , pages=. 2025 , publisher=

  48. [60]

    BioRxiv , pages=

    Tahoe-100m: A giga-scale single-cell perturbation atlas for context-dependent gene function and cellular modeling , author=. BioRxiv , pages=. 2025 , publisher=

  49. [61]

    Cell , volume=

    Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq , author=. Cell , volume=. 2022 , publisher=

  50. [62]

    Cell , volume=

    Virtual Cell Challenge: Toward a Turing test for the virtual cell , author=. Cell , volume=. 2025 , publisher=

  51. [63]

    bioRxiv , pages=

    AI-Guided CRISPR Screen Accelerates Discovery of New Drug Targets , author=. bioRxiv , pages=. 2026 , publisher=

  52. [64]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Rxrx1: A dataset for evaluating experimental batch correction methods , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  53. [65]

    Generative Artificial Intelligence for Biology: Toward Unifying Models, Algorithms, and Modalities , author=

  54. [66]

    bioRxiv , pages=

    Virtual Cells Need Context, Not Just Scale , author=. bioRxiv , pages=. 2026 , publisher=

  55. [67]

    Nature Biotechnology , volume=

    Predicting transcriptional outcomes of novel multigene perturbations with GEARS , author=. Nature Biotechnology , volume=. 2024 , publisher=

  56. [68]

    Science , volume=

    Active learning framework leveraging transcriptomics identifies modulators of disease phenotypes , author=. Science , volume=. 2025 , publisher=

  57. [69]

    Nature Methods , volume=

    Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines , author=. Nature Methods , volume=. 2025 , publisher=

  58. [70]

    Nature Biotechnology , volume=

    Defining and benchmarking open problems in single-cell analysis , author=. Nature Biotechnology , volume=. 2025 , publisher=

  59. [71]

    A new era of intelligence with gemini 3 , url=

    Google , year=. A new era of intelligence with gemini 3 , url=. Google , publisher=

  60. [72]

    OpenAI GPT-5 System Card

    Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

  61. [73]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Gepa: Reflective prompt evolution can outperform reinforcement learning , author=. arXiv preprint arXiv:2507.19457 , year=

  62. [74]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  63. [75]

    gpt-oss-120b & gpt-oss-20b Model Card

    gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

  64. [76]

    biorxiv , year=

    Biomni: A general-purpose biomedical ai agent , author=. biorxiv , year=

  65. [77]

    Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval , pages=

    Reciprocal rank fusion outperforms condorcet and individual rank learning methods , author=. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval , pages=

  66. [78]

    We Built an Open Source Autonomous Drug Discovery Agent , year =

  67. [79]

    bioRxiv , pages=

    Gene-embedding-based prediction and functional evaluation of perturbation expression responses with presage , author=. bioRxiv , pages=. 2025 , publisher=

  68. [80]

    , author=

    The PageRank citation ranking: Bringing order to the web. , author=. 1999 , institution=

  69. [81]

    2009 , publisher=

    The probabilistic relevance framework: BM25 and beyond , author=. 2009 , publisher=

  70. [82]

    Cell , volume=

    A human protein-protein interaction network: a resource for annotating the proteome , author=. Cell , volume=. 2005 , publisher=

  71. [83]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  72. [84]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=

  73. [85]

    GLM-5: from Vibe Coding to Agentic Engineering

    GLM-5: from Vibe Coding to Agentic Engineering , author=. arXiv preprint arXiv:2602.15763 , year=

  74. [86]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi K2. 5: Visual Agentic Intelligence , author=. arXiv preprint arXiv:2602.02276 , year=

  75. [87]

    Qwen3-coder-next technical report, 2026

    Qwen3-coder-next technical report , author=. arXiv preprint arXiv:2603.00729 , year=

  76. [88]

    Olmo 3

    Olmo 3 , author=. arXiv preprint arXiv:2512.13961 , year=

  77. [89]

    Advances in neural information processing systems , volume=

    Deep sets , author=. Advances in neural information processing systems , volume=

  78. [90]

    BioRxiv , pages=

    Cell2Sentence: teaching large language models the language of biology , author=. BioRxiv , pages=

  79. [91]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alphaevolve: A coding agent for scientific and algorithmic discovery , author=. arXiv preprint arXiv:2506.13131 , year=

  80. [92]

    2025 , month =

    Anthropic , title =. 2025 , month =

Showing first 80 references.