arxiv: 2605.10876 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI· q-bio.QM

Recognition: no theorem link

AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

Alexander Wu, Carl Edwards, David Richmond, Edward De Brouwer, Ehsan Hajiramezanali, Gabriele Scalia, Graham Heimberg, Jan-Christian H\"utter, Jenna Collier, Meena Subramaniam, Sara Mostafavi, Xiner Li

Pith reviewed 2026-05-12 04:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIq-bio.QM

keywords AssayBenchCRISPR screensphenotypic screeninggene rank predictionvirtual cellLLM benchmarkadjusted nDCGin silico screening

0 comments

The pith

Zero-shot generalist LLMs outperform biology-specific models and baselines on a benchmark of 1,920 CRISPR screens for gene-rank prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds AssayBench from 1,920 public CRISPR screens across five phenotype classes and turns each screen into a gene-ranking task scored by adjusted nDCG. It then tests whether current LLMs and agents can predict which genes would produce a given cellular phenotype when perturbed. A sympathetic reader cares because this setup directly targets the in-silico phenotypic screen that would let a virtual cell model replace many wet-lab experiments. The results show that off-the-shelf generalist LLMs already beat both domain-specific LLMs and trainable baselines in the zero-shot setting, while fine-tuning, ensembling, and prompt tuning raise performance further, though all methods remain well below an estimated performance ceiling.

Core claim

AssayBench formulates phenotypic screen prediction as gene-rank prediction on 1,920 CRISPR assays and introduces adjusted nDCG to compare results across heterogeneous readouts. Zero-shot generalist LLMs exceed biology-specific LLMs and trainable baselines on this task; optimization techniques such as fine-tuning, ensembling, and prompt optimization raise scores further, yet all evaluated methods stay far from the empirically estimated ceiling.

What carries the argument

AssayBench, which converts each CRISPR screen into a gene-ranking problem scored by adjusted nDCG to enable consistent comparison across diverse phenotypic assays.

If this is right

LLMs become a practical starting point for in-silico phenotypic screens that could guide drug-discovery workflows.
AssayBench supplies a concrete, public yardstick for measuring incremental progress toward virtual-cell models.
Optimization methods already known for language tasks transfer directly to this biological ranking setting.
Continued improvement on the benchmark would narrow the gap between current models and the performance ceiling observed in the data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If generalist models keep improving on AssayBench, they may reduce the number of preliminary wet-lab screens required before moving to targeted experiments.
The benchmark could be extended to other perturbation modalities such as small-molecule or genetic screens once suitable public data become available.
Success on gene ranking does not yet prove the model understands causal mechanisms, so follow-up work would need to test whether top-ranked genes actually produce the predicted phenotype in new contexts.

Load-bearing premise

That turning these 1,920 public CRISPR screens into gene-rank prediction tasks with adjusted nDCG gives a faithful stand-in for the full phenotypic-screening capability a working virtual cell would need.

What would settle it

A new, held-out CRISPR screen performed in an unseen cell type or condition where the genes ranked highest by the best AssayBench model show no measurable phenotypic effect when actually perturbed.

Figures

Figures reproduced from arXiv: 2605.10876 by Alexander Wu, Carl Edwards, David Richmond, Edward De Brouwer, Ehsan Hajiramezanali, Gabriele Scalia, Graham Heimberg, Jan-Christian H\"utter, Jenna Collier, Meena Subramaniam, Sara Mostafavi, Xiner Li.

**Figure 1.** Figure 1: Overview of the ASSAYBENCH benchmark creation. (A) Starting from 1971 human CRISPR screens, we perform data quality filtering, replicate merging, and data augmentation to obtain 1920 high quality screens. (B) Phenotype composition of the database and its four splits. A realistic but challenging temporal split was used. (C) Given a description of the screen and a gene ranking criteria, a model must provide … view at source ↗

**Figure 2.** Figure 2: (Left) AnDCG@k on the main models, colored by model category. (Right) Comparison of GEMINI 3 PRO performance with a technical replicate baselines (5.2). (N = 32 techincal replicate screens). 5.2 Top-performing models remain far from the performance ceiling Biology is inherently stochastic and experiments introduce further technical variability, raising the question of what performance ceiling can reasonabl… view at source ↗

**Figure 3.** Figure 3: stratifies AnDCG@100 of different models by phenotype. Predictive performance was highest for the viability screens, likely because their hit genes are enriched for conserved cellular dependencies that recur across screens. This recurrence also potentially explains why the phenotypebased frequency baseline is particularly strong for this class of screens. Other phenotypes, such as host-pathogen response o… view at source ↗

**Figure 4.** Figure 4: (Left) Scaling trend analysis on the Qwen3.5 family. Larger models lead to higher [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Biological bias of different models across gene sets. ASSAYBENCH is, to our knowledge, the first large-scale benchmark for phenotypic screen prediction. It also provides a testbed for evaluating LLMs and agents as surrogates for virtual cells, supporting progress in this area. A key design choice is to cast each assay as a single ranking problem rather than issuing one query per gene: with an average of … view at source ↗

**Figure 6.** Figure 6: Validation vs. test performance of ensemble functions produced by Bayesian Optimization [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Qwen ∼ 8B model performance over time. D.2 Per-Cohort Results for All Benchmarks [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

Recent advances in machine learning and large-scale biological data collections have revived the prospect of building a virtual cell, a computational model of cellular behavior that could accelerate biological discovery. One of the most compelling promises of this vision is the ability to perform in silico phenotypic screens, in which a model predicts the effects of cellular perturbations in unseen biological contexts. This task combines heterogeneous textual inputs with diverse phenotypic outputs, making it particularly well-suited to LLMs and agentic systems. Yet, no standard benchmark currently exists for this task, as existing efforts focus on narrower molecular readouts that are only indirectly aligned with the phenotypic endpoints driving many real-world drug discovery workflows. In this work, we present AssayBench, a benchmark for phenotypic screen prediction, built from 1,920 publicly available CRISPR screens spanning five broad classes of cellular phenotypes. We formulate the screen prediction task as a gene rank prediction for each screen and introduce the adjusted nDCG, a continuous metric for comparing performance across heterogeneous assays. Our extensive evaluation shows that existing methods remain far from empirically estimated performance ceilings and zero-shot generalist LLMs outperform biology-specific LLMs and trainable baselines. Optimization techniques such as fine-tuning, ensembling, and prompt optimization can further improve LLM performance on this task. Overall, AssayBench offers a practical testbed for measuring progress toward in silico phenotypic screening and, more broadly, virtual cell models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AssayBench gives a new gene-ranking benchmark from 1,920 public CRISPR screens and shows generalist LLMs beating specialized ones, but the link from this task to actual phenotypic screening utility remains untested.

read the letter

The paper introduces AssayBench by recasting 1,920 existing CRISPR screens across five phenotype classes as a per-assay gene ranking problem and proposes adjusted nDCG to score performance across heterogeneous outputs. The main empirical claim is that zero-shot generalist LLMs outperform both biology-tuned LLMs and standard trainable baselines, with further gains from fine-tuning, ensembling, and prompt tweaks, while all methods sit well below estimated ceilings.

Referee Report

2 major / 2 minor

Summary. The paper introduces AssayBench, a benchmark for phenotypic screen prediction in virtual cell models, constructed from 1,920 publicly available CRISPR screens spanning five phenotype classes. It formulates the task as per-assay gene ranking and introduces adjusted nDCG as the evaluation metric. Extensive experiments show that zero-shot generalist LLMs outperform biology-specific LLMs and trainable baselines on this task, with further gains possible via fine-tuning, ensembling, and prompt optimization, while all methods remain far from empirically estimated performance ceilings.

Significance. If the benchmark is shown to be a valid proxy, this work would provide a much-needed standardized testbed for in silico phenotypic screening, a key capability for virtual cell models in drug discovery. The use of public data, the introduction of adjusted nDCG for heterogeneous assays, and the clear performance ordering are strengths that could guide future LLM and agent development in biology. The empirical focus on phenotypic rather than molecular endpoints aligns well with real-world needs.

major comments (2)

[Abstract] Abstract: The central claim that AssayBench offers a 'practical testbed for measuring progress toward in silico phenotypic screening' and virtual cell models rests on the assumption that gene-rank prediction with adjusted nDCG on these 1,920 CRISPR screens is a faithful proxy for the broader task of predicting perturbation effects in unseen contexts with heterogeneous phenotypic outputs. No validation, correlation analysis, or downstream utility study is provided to show that performance on this ranking task predicts real phenotypic screen outcomes.
[Evaluation] Evaluation and Methods sections: The reported outperformance of zero-shot generalist LLMs and the performance gaps to ceilings cannot be fully assessed without explicit details on screen selection criteria, data splits to avoid leakage across the 1,920 assays, and the exact computation of adjusted nDCG (including any hyperparameters or post-processing). This makes it impossible to rule out post-hoc tuning or verify robustness of the ordering.

minor comments (2)

[Abstract] The abstract introduces 'adjusted nDCG' without a one-sentence definition or reference, which reduces accessibility for readers outside information retrieval.
A summary table listing the five phenotype classes, number of screens per class, and example endpoints would improve clarity on the benchmark's diversity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and made revisions to improve clarity and address concerns regarding the benchmark's validity and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that AssayBench offers a 'practical testbed for measuring progress toward in silico phenotypic screening' and virtual cell models rests on the assumption that gene-rank prediction with adjusted nDCG on these 1,920 CRISPR screens is a faithful proxy for the broader task of predicting perturbation effects in unseen contexts with heterogeneous phenotypic outputs. No validation, correlation analysis, or downstream utility study is provided to show that performance on this ranking task predicts real phenotypic screen outcomes.

Authors: We agree that a direct empirical validation linking performance on AssayBench to downstream real-world outcomes would strengthen the case for it as a proxy. Our work establishes AssayBench as a benchmark derived directly from 1,920 real phenotypic CRISPR screens, where the task is to rank genes according to their observed effects on the phenotype. This formulation captures the essence of in silico phenotypic screening. We have partially revised the manuscript by qualifying the language in the abstract and adding a new subsection in the Discussion that explicitly discusses the assumptions underlying the proxy, acknowledges the lack of correlation studies, and outlines plans for future work to validate against additional experimental data. revision: partial
Referee: [Evaluation] Evaluation and Methods sections: The reported outperformance of zero-shot generalist LLMs and the performance gaps to ceilings cannot be fully assessed without explicit details on screen selection criteria, data splits to avoid leakage across the 1,920 assays, and the exact computation of adjusted nDCG (including any hyperparameters or post-processing). This makes it impossible to rule out post-hoc tuning or verify robustness of the ordering.

Authors: We thank the referee for highlighting the need for greater transparency. We have revised the Methods and Evaluation sections to provide explicit details: screen selection criteria include requirements for sufficient statistical power (e.g., number of guides, replicates) and coverage across the five phenotype classes from public sources; data splits are performed at the assay level using stratified random partitioning with no shared genes or biological contexts between splits to prevent leakage; the adjusted nDCG formula is now fully specified with the adjustment term for varying assay lengths and relevance scores based on differential expression statistics, along with all hyperparameters and post-processing steps. An appendix with implementation details and additional ablation studies has been added to verify the robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark on external public data with independent performance comparisons

full rationale

The paper defines AssayBench from 1,920 publicly available external CRISPR screens, formulates the task as gene-rank prediction, and introduces adjusted nDCG as a new metric for heterogeneous assays. Central claims consist of empirical comparisons showing zero-shot generalist LLMs outperforming biology-specific LLMs and trainable baselines on held-out screens, with further gains from fine-tuning, ensembling, and prompt optimization. No derivation reduces by construction to fitted internal parameters, self-citations, or tautological renaming; the evaluation is self-contained against external benchmarks and does not invoke uniqueness theorems or load-bearing prior results from the same authors to justify the core findings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard assumptions that public CRISPR screens are reliable phenotypic readouts and that gene-rank prediction is a suitable proxy for phenotypic screening; no new free parameters, axioms, or invented entities are introduced beyond ordinary benchmark construction.

pith-pipeline@v0.9.0 · 5594 in / 1203 out tokens · 28247 ms · 2026-05-12T04:43:33.080712+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · 12 internal anchors

[1]

Transportation Research Record: Journal of the Transportation Research Board , number=

Theoretical maximum capacity as benchmark for empty vehicle redistribution in personal rapid transit , author=. Transportation Research Record: Journal of the Transportation Research Board , number=. 2010 , publisher=

work page 2010
[4]

Genome-wide single-cell perturbation screens with viperturb-seq

Alexandra Bradu, John D Blair, Isabella N Grabski, Isabella Mascio, Junsuk Lee, Cecilia McCormick, and Rahul Satija. Genome-wide single-cell perturbation screens with viperturb-seq. bioRxiv, pages 2026--02, 2026

work page 2026
[5]

How to build the virtual cell with artificial intelligence: Priorities and opportunities

Charlotte Bunne, Yusuf Roohani, Yanay Rosen, Ankit Gupta, Xikun Zhang, Marcel Roed, Theo Alexandrov, Mohammed AlQuraishi, Patricia Brennan, Daniel B Burkhardt, et al. How to build the virtual cell with artificial intelligence: Priorities and opportunities. Cell, 187 0 (25): 0 7045--7063, 2024

work page 2024
[6]

Rational design of synthetic proteins using a genome-scale crispr screen

Wells H Burrell, Simon J Mueller, Zharko Daniloski, P Duffy Doyle Jr, Anne B Rovsing, Christopher James, Max Drabkin, Chien-Yu Chou, Hei Yu Annika So, Lyla Katgara, et al. Rational design of synthetic proteins using a genome-scale crispr screen. bioRxiv, pages 2026--02, 2026

work page 2026
[7]

Reciprocal rank fusion outperforms condorcet and individual rank learning methods

Gordon V Cormack, Charles LA Clarke, and Stefan Buettcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 758--759, 2009

work page 2009
[8]

scgpt: toward building a foundation model for single-cell multi-omics using generative ai

Haotian Cui, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nature methods, 21 0 (8): 0 1470--1480, 2024

work page 2024
[9]

Systematic discovery of crispr-boosted car t cell immunotherapies

Paul Datlinger, Eugenia V Pankevich, Cosmas D Arnold, Nicole Pranckevicius, Jenny Lin, Daria Romanovskaia, Moritz Schaefer, Francesco Piras, Anne-Christine Orts, Amelie Nemc, et al. Systematic discovery of crispr-boosted car t cell immunotherapies. Nature, 646 0 (8086): 0 963--972, 2025

work page 2025
[10]

A new era of intelligence with gemini 3, Nov 2025

Google. A new era of intelligence with gemini 3, Nov 2025. URL https://blog.google/products-and-platforms/products/gemini/gemini-3/

work page 2025
[11]

Biomni: A general-purpose biomedical ai agent

Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent. biorxiv, 2025

work page 2025
[13]

Virus-like particles enable targeted gene engineering and pooled crispr screening in primary human myeloid cells

Hyuncheol Jung, Pascal Devant, Carter Ching, Mineto Ota, Jennifer Hamilton, Zachary Steinhart, Wayne Ngo, Luis Sandoval, Jae Hyung Jung, Da Xu, et al. Virus-like particles enable targeted gene engineering and pooled crispr screening in primary human myeloid cells. bioRxiv, pages 2025--12, 2025

work page 2025
[15]

Gene-embedding-based prediction and functional evaluation of perturbation expression responses with presage

Russell Littman, Jacob Levine, Sepideh Maleki, Yongju Lee, Vladimir Ermakov, Lin Qiu, Alexander Wu, Kexin Huang, Romain Lopez, Gabriele Scalia, et al. Gene-embedding-based prediction and functional evaluation of perturbation expression responses with presage. bioRxiv, pages 2025--06, 2025

work page 2025
[18]

The biogrid database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions

Rose Oughtred, Jennifer Rust, Christie Chang, Bobby-Joe Breitkreutz, Chris Stark, Andrew Willems, Lorrie Boucher, Genie Leung, Nadine Kolas, Frederick Zhang, et al. The biogrid database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Science, 30 0 (1): 0 187--200, 2021

work page 2021
[19]

scperturb: harmonized single-cell perturbation data

Stefan Peidli, Tessa D Green, Ciyue Shen, Torsten Gross, Joseph Min, Samuele Garda, Bo Yuan, Linus J Schumacher, Jake P Taylor-King, Debora S Marks, et al. scperturb: harmonized single-cell perturbation data. Nature Methods, 21 0 (3): 0 531--540, 2024

work page 2024
[21]

Disentanglement of single-cell data with biolord

Zoe Piran, Niv Cohen, Yedid Hoshen, and Mor Nitzan. Disentanglement of single-cell data with biolord. Nature Biotechnology, 42 0 (11): 0 1678--1683, 2024

work page 2024
[22]

Qwen3.5 : Towards native multimodal agents, February 2026

Qwen Team . Qwen3.5 : Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

work page 2026
[23]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First conference on language modeling, 2024

work page 2024
[24]

Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq

Joseph M Replogle, Reuben A Saunders, Angela N Pogson, Jeffrey A Hussmann, Alexander Lenail, Alina Guna, Lauren Mascibroda, Eric J Wagner, Karen Adelman, Gila Lithwick-Yanai, et al. Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq. Cell, 185 0 (14): 0 2559--2575, 2022

work page 2022
[25]

Scaling large language models for next-generation single-cell analysis

Syed Asad Rizvi, Daniel Levine, Aakash Patel, Shiyang Zhang, Eric Wang, Curtis Jamison Perry, Ivan Vrkic, Nicole Mayerli Constante, Zirui Fu, Sizhuang He, et al. Scaling large language models for next-generation single-cell analysis. BioRxiv, pages 2025--04, 2026

work page 2025
[26]

Predicting transcriptional outcomes of novel multigene perturbations with gears

Yusuf Roohani, Kexin Huang, and Jure Leskovec. Predicting transcriptional outcomes of novel multigene perturbations with gears. Nature Biotechnology, 42 0 (6): 0 927--935, 2024

work page 2024
[27]

Virtual cell challenge: Toward a turing test for the virtual cell

Yusuf H Roohani, Tony J Hua, Po-Yuan Tung, Lexi R Bounds, Feiqiao B Yu, Alexander Dobin, Noam Teyssier, Abhinav Adduri, Alden Woodrow, Brian S Plosky, et al. Virtual cell challenge: Toward a turing test for the virtual cell. Cell, 188 0 (13): 0 3370--3374, 2025

work page 2025
[29]

Openevolve: an open-source evolutionary coding agent, 2025

Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025. URL https://github.com/algorithmicsuperintelligence/openevolve

work page 2025
[31]

Virtual crispr: Can llms predict crispr screen results? In Proceedings of the 24th Workshop on Biomedical Language Processing, pages 354--364, 2025

Steven Song, Abdalla Abdrabou, Asmita Dabholkar, Kastan Day, Pavan Dharmoju, Jason Perera, Volodymyr Kindratenko, and Aly Khan. Virtual crispr: Can llms predict crispr screen results? In Proceedings of the 24th Workshop on Biomedical Language Processing, pages 354--364, 2025

work page 2025
[32]

Rxrx1: A dataset for evaluating experimental batch correction methods

Maciej Sypetkowski, Morteza Rezanejad, Saber Saberian, Oren Kraus, John Urbanik, James Taylor, Ben Mabey, Mason Victors, Jason Yosinski, Alborz Rezazadeh Sereshkeh, et al. Rxrx1: A dataset for evaluating experimental batch correction methods. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4285--4294, 2023

work page 2023
[33]

Phenotypic drug discovery: recent successes, lessons learned and new directions

Fabien Vincent, Arsenio Nueda, Jonathan Lee, Monica Schenone, Marco Prunotto, and Mark Mercola. Phenotypic drug discovery: recent successes, lessons learned and new directions. Nature Reviews Drug Discovery, 21 0 (12): 0 899--914, 2022

work page 2022
[34]

Genetic screens in human cells using the crispr-cas9 system

Tim Wang, Jenny J Wei, David M Sabatini, and Eric S Lander. Genetic screens in human cells using the crispr-cas9 system. Science, 343 0 (6166): 0 80--84, 2014

work page 2014
[35]

A theoretical analysis of ndcg type ranking measures

Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. A theoretical analysis of ndcg type ranking measures. In Conference on learning theory, pages 25--54. PMLR, 2013

work page 2013
[38]

scbasecount: an ai agent-curated, uniformly processed, and continually expanding single cell data repository

Nicholas D Youngblut, Christopher Carpenter, Jaanak Prashar, Chiara Ricci-Tam, Rajesh Ilango, Noam Teyssier, Silvana Konermann, Patrick D Hsu, Alexander Dobin, David P Burke, et al. scbasecount: an ai agent-curated, uniformly processed, and continually expanding single cell data repository. bioRxiv, pages 2025--02, 2025

work page 2025
[39]

Deep sets

Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. Advances in neural information processing systems, 30, 2017

work page 2017
[41]

Genome-scale perturb-seq in primary human cd4+ t cells maps context-specific regulators of t cell programs and human immune traits

Ronghui Zhu, Emma Dann, Jun Yan, Justine Reyes Retana, Ryunosuke Goto, Reese C Guitche, Lillian K Petersen, Mineto Ota, Jonathan K Pritchard, and Alexander Marson. Genome-scale perturb-seq in primary human cd4+ t cells maps context-specific regulators of t cell programs and human immune traits. bioRxiv, pages 2025--12, 2025

work page 2025
[42]

Cell , volume=

How to build the virtual cell with artificial intelligence: Priorities and opportunities , author=. Cell , volume=. 2024 , publisher=

work page 2024
[43]

Nature Methods , volume=

scPerturb: harmonized single-cell perturbation data , author=. Nature Methods , volume=. 2024 , publisher=

work page 2024
[44]

Nature Reviews Drug Discovery , volume=

Phenotypic drug discovery: recent successes, lessons learned and new directions , author=. Nature Reviews Drug Discovery , volume=. 2022 , publisher=

work page 2022
[45]

Perturbench: Benchmarking machine learning models for cellular perturbation analysis, 2025

Perturbench: Benchmarking machine learning models for cellular perturbation analysis , author=. arXiv preprint arXiv:2408.10609 , year=

work page arXiv
[46]

ACM Transactions on Information Systems (TOIS) , volume=

Cumulated gain-based evaluation of IR techniques , author=. ACM Transactions on Information Systems (TOIS) , volume=. 2002 , publisher=

work page 2002
[47]

Journal of classification , volume=

Comparing partitions , author=. Journal of classification , volume=. 1985 , publisher=

work page 1985
[48]

BioRxiv , pages=

Scaling large language models for next-generation single-cell analysis , author=. BioRxiv , pages=

work page
[49]

Educational and psychological measurement , volume=

A coefficient of agreement for nominal scales , author=. Educational and psychological measurement , volume=. 1960 , publisher=

work page 1960
[50]

Information Processing & Management , volume=

Joint upper & expected value normalization for evaluation of retrieval systems: A case study with Learning-to-Rank methods , author=. Information Processing & Management , volume=. 2023 , publisher=

work page 2023
[51]

Conference on learning theory , pages=

A theoretical analysis of NDCG type ranking measures , author=. Conference on learning theory , pages=. 2013 , organization=

work page 2013
[52]

Protein Science , volume=

The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions , author=. Protein Science , volume=. 2021 , publisher=

work page 2021
[53]

arXiv preprint arXiv:2503.00096 , year=

Bixbench: a comprehensive benchmark for llm-based agents in computational biology , author=. arXiv preprint arXiv:2503.00096 , year=

work page arXiv
[54]

arXiv preprint arXiv:2503.04013 , year=

Benchmarking large language models on multiple tasks in bioinformatics nlp with prompting , author=. arXiv preprint arXiv:2503.04013 , year=

work page arXiv
[55]

First conference on language modeling , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. First conference on language modeling , year=

work page
[56]

Humanity's Last Exam

Humanity's last exam , author=. arXiv preprint arXiv:2501.14249 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Proceedings of the 24th Workshop on Biomedical Language Processing , pages=

Virtual CRISPR: Can LLMs Predict CRISPR Screen Results? , author=. Proceedings of the 24th Workshop on Biomedical Language Processing , pages=

work page
[58]

arXiv preprint arXiv:2502.21290 , year=

Contextualizing biological perturbation experiments through language , author=. arXiv preprint arXiv:2502.21290 , year=

work page arXiv
[59]

bioRxiv , pages=

scBaseCount: an AI agent-curated, uniformly processed, and continually expanding single cell data repository , author=. bioRxiv , pages=. 2025 , publisher=

work page 2025
[60]

BioRxiv , pages=

Tahoe-100m: A giga-scale single-cell perturbation atlas for context-dependent gene function and cellular modeling , author=. BioRxiv , pages=. 2025 , publisher=

work page 2025
[61]

Cell , volume=

Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq , author=. Cell , volume=. 2022 , publisher=

work page 2022
[62]

Cell , volume=

Virtual Cell Challenge: Toward a Turing test for the virtual cell , author=. Cell , volume=. 2025 , publisher=

work page 2025
[63]

bioRxiv , pages=

AI-Guided CRISPR Screen Accelerates Discovery of New Drug Targets , author=. bioRxiv , pages=. 2026 , publisher=

work page 2026
[64]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Rxrx1: A dataset for evaluating experimental batch correction methods , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[65]

Generative Artificial Intelligence for Biology: Toward Unifying Models, Algorithms, and Modalities , author=

work page
[66]

bioRxiv , pages=

Virtual Cells Need Context, Not Just Scale , author=. bioRxiv , pages=. 2026 , publisher=

work page 2026
[67]

Nature Biotechnology , volume=

Predicting transcriptional outcomes of novel multigene perturbations with GEARS , author=. Nature Biotechnology , volume=. 2024 , publisher=

work page 2024
[68]

Science , volume=

Active learning framework leveraging transcriptomics identifies modulators of disease phenotypes , author=. Science , volume=. 2025 , publisher=

work page 2025
[69]

Nature Methods , volume=

Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines , author=. Nature Methods , volume=. 2025 , publisher=

work page 2025
[70]

Nature Biotechnology , volume=

Defining and benchmarking open problems in single-cell analysis , author=. Nature Biotechnology , volume=. 2025 , publisher=

work page 2025
[71]

A new era of intelligence with gemini 3 , url=

Google , year=. A new era of intelligence with gemini 3 , url=. Google , publisher=

work page
[72]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Gepa: Reflective prompt evolution can outperform reinforcement learning , author=. arXiv preprint arXiv:2507.19457 , year=

work page internal anchor Pith review arXiv
[74]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[75]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

biorxiv , year=

Biomni: A general-purpose biomedical ai agent , author=. biorxiv , year=

work page
[77]

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval , pages=

Reciprocal rank fusion outperforms condorcet and individual rank learning methods , author=. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval , pages=

work page
[78]

We Built an Open Source Autonomous Drug Discovery Agent , year =

work page
[79]

bioRxiv , pages=

Gene-embedding-based prediction and functional evaluation of perturbation expression responses with presage , author=. bioRxiv , pages=. 2025 , publisher=

work page 2025
[80]

, author=

The PageRank citation ranking: Bringing order to the web. , author=. 1999 , institution=

work page 1999
[81]

2009 , publisher=

The probabilistic relevance framework: BM25 and beyond , author=. 2009 , publisher=

work page 2009
[82]

Cell , volume=

A human protein-protein interaction network: a resource for annotating the proteome , author=. Cell , volume=. 2005 , publisher=

work page 2005
[83]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[84]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[85]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5: from Vibe Coding to Agentic Engineering , author=. arXiv preprint arXiv:2602.15763 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[86]

Kimi K2.5: Visual Agentic Intelligence

Kimi K2. 5: Visual Agentic Intelligence , author=. arXiv preprint arXiv:2602.02276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[87]

Qwen3-coder-next technical report, 2026

Qwen3-coder-next technical report , author=. arXiv preprint arXiv:2603.00729 , year=

work page arXiv
[88]

Olmo 3

Olmo 3 , author=. arXiv preprint arXiv:2512.13961 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[89]

Advances in neural information processing systems , volume=

Deep sets , author=. Advances in neural information processing systems , volume=

work page
[90]

BioRxiv , pages=

Cell2Sentence: teaching large language models the language of biology , author=. BioRxiv , pages=

work page
[91]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alphaevolve: A coding agent for scientific and algorithmic discovery , author=. arXiv preprint arXiv:2506.13131 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[92]

2025 , month =

Anthropic , title =. 2025 , month =

work page 2025

Showing first 80 references.