Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

Jinjiang Guo

arxiv: 2604.26498 · v2 · pith:3JNQYK2Snew · submitted 2026-04-29 · 💻 cs.LG · q-bio.QM

Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

Jinjiang Guo This is my paper

Pith reviewed 2026-05-19 17:37 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM

keywords molecular property predictiondrug discoverymodel scalinggraph neural networkslarge language modelscheminformaticsbenchmarkAI for chemistry

0 comments

The pith

Classical ML models outperform larger pretrained and LLM approaches in most molecular prediction tasks for drug discovery

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the idea that bigger models will automatically win in AI-driven drug discovery by running a large-scale benchmark. It evaluates classical machine learning, graph neural networks, pretrained sequence models, and LLM-based baselines on 26 endpoints spanning ADME properties, toxicity, and bioactivity. The tests use 78 endpoint-split combinations with random, Murcko scaffold, and structure-separated 5-fold cross-validation to simulate easy retrospective checks through to hard novel-chemotype scenarios. Across 156 comparisons, compact classical models win the large majority of cases, showing that performance depends on matching model family to task and split difficulty rather than on increasing scale.

Core claim

Across 78 endpoint and split entries for molecular properties, toxicity, safety liabilities and biological activity, classical ML such as RF(ECFP4) and ExtraTrees(RDKit) win 116 out of 156 fold mean comparisons, GNNs such as GIN and Ligandformer win 25, pretrained sequence models such as MoLFormer and ChemBERTa2 win 12, and LLM based SAR baselines win three. ML dominates random split interpolation but loses part of this advantage under harder splits; GNN and sequence models also decline but gain relative ground, whereas LLM based SAR is weaker in absolute terms yet less sensitive to the split axis. SAR knowledge derived from training folds improves many GPT5.5-SAR and Opus4.7-SAR metrics but

What carries the argument

The tiered cross-validation benchmark using random, Murcko scaffold and structure-separated 5-fold splits on 26 endpoints grouped into ADME, toxicity and bioactivity classes to compare four model families under increasing generalization demands

If this is right

Classical ML models achieve highest accuracy on easier random splits but their lead narrows on scaffold and structure-separated splits
GNNs and pretrained sequence models lose ground in absolute terms on harder splits yet improve their relative ranking against classical ML
LLM-based SAR baselines deliver lower absolute performance but remain more stable when split difficulty increases
Incorporating SAR knowledge from the training folds raises LLM metrics without turning rule-based reasoning into a universal replacement for supervised predictors
Overall success depends on the fit between model family, task type and validation scenario rather than on model scale

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams running routine high-throughput screens could default to fast classical models to reduce compute costs without sacrificing accuracy
LLMs may still add value in very low-data regimes for generating SAR hypotheses even if they trail on raw prediction metrics
Future work should test these patterns on time-based or truly prospective splits to check whether the observed family rankings hold in live discovery campaigns

Load-bearing premise

The 78 endpoint and split entries using random, Murcko scaffold and structure-separated 5-fold CV adequately represent the spectrum of real-world drug discovery challenges from closed-library retrospective evaluation to novel chemotype library expansion

What would settle it

A follow-up study on a fresh set of endpoints or on prospectively collected compounds where larger pretrained or LLM models consistently beat classical ML across all three split types would show the scaling assumption holds after all

Figures

Figures reproduced from arXiv: 2604.26498 by Jinjiang Guo.

**Figure 2.** Figure 2: Molecular representation pathways compared in the benchmark. Fingerprints and de view at source ↗

**Figure 3.** Figure 3: Structure-similarity-separated five-fold cross-validation workflow. Molecules are stan view at source ↗

**Figure 4.** Figure 4: Proportional summary of model-family wins across ADMET, Tox21 and anti-infective view at source ↗

**Figure 5.** Figure 5: Effect of train-fold-derived SAR knowledge on LLM-SAR performance across task groups view at source ↗

read the original abstract

The rapid growth of molecular foundation models and large language models has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and graph neural networks (GNNs) trained for individual tasks. We test this assumption across 26 endpoints for molecular properties, toxicity, safety liabilities and biological activity, grouped into ADME, toxicity and bioactivity classes. The benchmark contains 78 endpoint and split entries spanning random, Murcko scaffold and structure separated 5-fold CV. Ordered from easiest to hardest, these splits approximate retrospective evaluation on a closed library, scaffold expansion in hit to lead, and library expansion on novel chemotypes. Each entry includes ML, GNN, pretrained molecular sequence and LLM based SAR families. Across 156 fold mean comparisons, classical ML such as RF(ECFP4) and ExtraTrees(RDKit) win 116, GNNs such as GIN and Ligandformer win 25, pretrained sequence models such as MoLFormer and ChemBERTa2 win 12, and LLM based SAR baselines win three. ML dominates random split interpolation but loses part of this advantage under harder splits; GNN and sequence models also decline but gain relative ground, whereas LLM based SAR is weaker in absolute terms yet less sensitive to the split axis. Paired bootstrap analyses support family level trends more strongly than individual model rankings. SAR knowledge derived from training folds improves many GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective for molecular property and activity prediction. Larger models add value for SAR interpretation and reasoning in low data settings, but predictive performance depends on the fit among model, task and validation scenario, not on scale alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Classical ML models win most head-to-heads here, but the splits don't fully test prospective shifts that matter in real drug discovery.

read the letter

The core finding is that random forests on ECFP4 and ExtraTrees on RDKit descriptors beat GNNs, pretrained sequence models, and LLM SAR baselines in 116 of 156 fold-mean comparisons across 26 endpoints. The paper runs the same tasks under random, Murcko-scaffold, and structure-separated 5-fold CV to stand in for closed-library, scaffold-expansion, and novel-chemotype scenarios. Classical ML dominates the easy splits and loses some ground on harder ones, while the larger families improve relatively but still trail in absolute numbers. Paired bootstrap results back the family-level ordering more than any single model ranking. SAR prompting helps the GPT and Opus variants in low-data cases but does not replace supervised predictors outright. That multi-family, multi-split tally on ADME, toxicity, and bioactivity endpoints is the concrete new piece; prior scaling debates have not lined up exactly these four families against the same 78 endpoint-split combinations. The work is useful for anyone choosing models for routine property prediction rather than open-ended generation. Methods transparency is thin in the abstract—no error bars, no exclusion rules, no hyperparameter protocol—so the win counts are hard to stress-test without the full tables. The stress-test note on temporal ordering is on point: scaffold and structure splits still allow analog leakage that a strict assay-date cut would block, and that could shrink the classical ML advantage in genuine prospective use. The paper is aimed at computational drug-discovery groups deciding between scaling and specialization. It is coherent on its own terms and engages the scaling claim directly with data, so it should go to peer review even if the methods section needs expansion and a temporal-split ablation would strengthen it.

Referee Report

2 major / 2 minor

Summary. The manuscript benchmarks classical ML models (e.g., RF(ECFP4), ExtraTrees(RDKit)), GNNs (e.g., GIN, Ligandformer), pretrained sequence models (e.g., MoLFormer, ChemBERTa2), and LLM-based SAR baselines across 26 endpoints in ADME, toxicity, and bioactivity categories. Using 78 endpoint/split entries with random, Murcko scaffold, and structure-separated 5-fold CV, it reports classical ML winning 116 of 156 fold-mean comparisons, GNNs winning 25, pretrained models 12, and LLMs 3. The central claim is that compact specialized models remain highly effective for predictive performance, while larger models add value mainly for SAR interpretation in low-data settings, with performance depending on model-task-validation fit rather than scale alone.

Significance. If the empirical results hold under scrutiny, the work is significant for providing a large-scale, multi-family comparison that challenges scale-centric assumptions in AI for drug discovery. It supplies concrete win-rate data and notes relative robustness of LLM-SAR to split difficulty, which could inform model selection. The use of paired bootstrap analyses for family trends and the ordering of splits by difficulty are positive features that increase the benchmark's reference value.

major comments (2)

[Results] Results section (win-count reporting): The abstract and results state classical ML wins 116 of 156 comparisons, yet no per-family variance, confidence intervals, or explicit tie-handling rules are supplied alongside the bootstrap analyses; without these, the strength of the family-level dominance claim cannot be fully evaluated from the numbers alone.
[Methods] Methods (validation splits): The structure-separated 5-fold CV is used to approximate novel-chemotype library expansion, but the manuscript provides no quantitative checks for residual chemical similarity or analog leakage between folds, nor any comparison against strict temporal splits by assay or patent date; this leaves open whether the observed classical-ML advantage would persist under the distribution shifts typical of prospective drug-discovery validation.

minor comments (2)

[Abstract] Abstract: The terms 'GPT5.5-SAR' and 'Opus4.7-SAR' appear without prior definition or reference to the underlying LLM versions and prompting strategy.
[Tables] Tables: Ensure every results table lists the exact number of comparisons contributing to each win count so readers can verify the 156 total and the per-class breakdowns.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments and for recognizing the benchmark's scope and reference value. We address each major comment below with clarifications and indicate where revisions will be made.

read point-by-point responses

Referee: [Results] Results section (win-count reporting): The abstract and results state classical ML wins 116 of 156 comparisons, yet no per-family variance, confidence intervals, or explicit tie-handling rules are supplied alongside the bootstrap analyses; without these, the strength of the family-level dominance claim cannot be fully evaluated from the numbers alone.

Authors: The win counts are presented as descriptive aggregates, while the paired bootstrap analyses were used to evaluate family-level trends. We agree that explicit reporting of per-family variance, confidence intervals, and tie-handling rules would strengthen interpretability. In the revised manuscript we will add bootstrap-derived confidence intervals for each family's win rate and specify the tie-handling procedure (ties assigned proportionally to the tied families). revision: yes
Referee: [Methods] Methods (validation splits): The structure-separated 5-fold CV is used to approximate novel-chemotype library expansion, but the manuscript provides no quantitative checks for residual chemical similarity or analog leakage between folds, nor any comparison against strict temporal splits by assay or patent date; this leaves open whether the observed classical-ML advantage would persist under the distribution shifts typical of prospective drug-discovery validation.

Authors: We will incorporate quantitative checks for residual similarity, including mean and distribution of ECFP4 Tanimoto similarities between training and test folds, to document the degree of analog leakage. However, the source datasets do not contain consistent assay or patent dates for all 26 endpoints, precluding a uniform temporal-split comparison. We will note this data limitation explicitly and discuss the structure-separated split as a practical proxy for prospective validation. revision: partial

standing simulated objections not resolved

Direct comparison to strict temporal splits by assay or patent date, because consistent temporal metadata is unavailable across the full set of public datasets used.

Circularity Check

0 steps flagged

No circularity: empirical benchmark rests on direct held-out comparisons

full rationale

The paper reports model performance via explicit 5-fold CV on 78 endpoint/split combinations, counting wins across 156 comparisons and supporting trends with paired bootstrap. No equations, fitted parameters, or derivations are present; results are computed directly from held-out test folds rather than being redefined or predicted from the training statistics themselves. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim therefore remains an independent empirical observation rather than a self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical benchmark study. It introduces no new mathematical derivations, fitted constants, or postulated physical entities. The main untested premise is that the chosen endpoints and splits stand in for real drug-discovery scenarios.

axioms (1)

domain assumption The random, Murcko scaffold, and structure-separated 5-fold CV splits approximate retrospective closed-library evaluation, scaffold expansion in hit-to-lead, and library expansion on novel chemotypes.
Explicitly stated in the abstract as the ordering from easiest to hardest splits.

pith-pipeline@v0.9.0 · 5873 in / 1389 out tokens · 60076 ms · 2026-05-19T17:37:02.926593+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 4 internal anchors

[1]

Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S

Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. Moleculenet: A benchmark for molec- ular machine learning.Chemical Science, 9:513–530, 2018. doi: 10.1039/C7SC02664A

work page doi:10.1039/c7sc02664a 2018
[2]

Huang, T

Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Connor W. Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. URLhttps:// arxiv.org/a...

work page arXiv 2021
[3]

Bronskill, Krzysztof Maziarz, Henryk Misztela, Julien Lanini, Marwin Segler, Nadine Schneider, and Marc Brockschmidt

Megan Stanley, John F. Bronskill, Krzysztof Maziarz, Henryk Misztela, Julien Lanini, Marwin Segler, Nadine Schneider, and Marc Brockschmidt. Fs-mol: A few-shot learning dataset of molecules. InNeurIPS Datasets and Benchmarks Track, 2021. URLhttps://openreview. net/forum?id=701FtuyLlAd

work page 2021
[4]

Limitations of representation learning in small molecule property prediction.Nature Communications, 14:6394, 2023

Ana Laura Dias, Latimah Bustillo, and Tiago Rodrigues. Limitations of representation learning in small molecule property prediction.Nature Communications, 14:6394, 2023. doi: 10.1038/ s41467-023-41967-3. URLhttps://www.nature.com/articles/s41467-023-41967-3

work page 2023
[5]

Jun Xia, Lecheng Zhang, Xiao Zhu, and Stan Z. Li. Why deep models often cannot beat non-deep counterparts on molecular property prediction?, 2023. URLhttps://arxiv.org/ abs/2306.17702

work page arXiv 2023
[6]

Benchmarking ma- chine learning in admet predictions: The practical impact of feature representations in ligand- based models.Journal of Cheminformatics, 17:108, 2025

Gintautas Kamuntavicius, Tanya Paquet, Orestis Bastas, Dainius Salkauskas, Alvaro Prat, Hisham Abdel Aty, Aurimas Pabrinkis, Povilas Norvaisas, and Roy Tal. Benchmarking ma- chine learning in admet predictions: The practical impact of feature representations in ligand- based models.Journal of Cheminformatics, 17:108, 2025. doi: 10.1186/s13321-025-01041-0....

work page doi:10.1186/s13321-025-01041-0 2025
[7]

ChemBERTa: large -scale self -supervised pretraining fo r molecular property prediction

Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta: Large-scale self- supervised pretraining for molecular property prediction, 2020. URLhttps://arxiv.org/ abs/2010.09885

work page arXiv 2020
[8]

ChemBERTa- 2: Towards chemical foundation models.arXiv preprint arXiv:2209.01712, 2022

Walid Ahmad, Eric Simon, Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta-2: Towards chemical foundation models, 2022. URLhttps://arxiv.org/abs/ 2209.01712

work page arXiv 2022
[9]

Large-scale chemical language representations capture molecular structure and properties.Nature Machine Intelligence, 2022

Jerret Ross, Brian Belgodere, Vijil Chenthamarakshan, Inkit Padhi, Youssef Mroueh, and Payel Das. Large-scale chemical language representations capture molecular structure and properties.Nature Machine Intelligence, 2022. URLhttps://www.nature.com/articles/ s42256-022-00580-7

work page 2022
[10]

Tice, Christopher P

Raymond R. Tice, Christopher P. Austin, Robert J. Kavlock, and John R. Bucher. Tox21 challenge to build predictive models of nuclear receptor and stress response pathways as medi- ated by exposure to environmental chemicals and drugs.Frontiers in Environmental Science, 16

work page
[11]

URLhttps://www.frontiersin.org/journals/environmental-science/articles/ 10.3389/fenvs.2015.00085/full

work page doi:10.3389/fenvs.2015.00085/full 2015
[12]

Qsar modeling of tox21 challenge stress response and nuclear receptor signaling toxicity assays.Frontiers in Environmental Science, 2016

Andreas Mayr, Gunter Klambauer, Thomas Unterthiner, and Sepp Hochreiter. Qsar modeling of tox21 challenge stress response and nuclear receptor signaling toxicity assays.Frontiers in Environmental Science, 2016. URLhttps://www.frontiersin.org/articles/10.3389/ fenvs.2016.00003/full

work page arXiv 2016
[13]

Lemenze, Emily C

Poonam Chitale, Alexander D. Lemenze, Emily C. Fogarty, Avi Shah, Courtney Grady, Aubrey R. Odom-Mabey, W. Evan Johnson, Jason H. Yang, A. Murat Eren, Roland Brosch, Pradeep Kumar, and David Alland. A comprehensive update to the mycobac- terium tuberculosis h37rv reference genome.Nature Communications, 13:7068, 2022. doi: 10.1038/s41467-022-34853-x

work page doi:10.1038/s41467-022-34853-x 2022
[14]

Wallace, Vineet Kumar, Ursula Pieper, Andrej Sali, Jeremy R

Francisco Mart ’inez-Jim ’enez, George Papadatos, Li Yang, Iain M. Wallace, Vineet Kumar, Ursula Pieper, Andrej Sali, Jeremy R. Brown, John P. Overington, and Marc A. Marti-Renom. Target prediction for an open access set of compounds active against mycobacterium tuberculosis.PLoS Computational Biology, 9(10):e1003253, 2013. doi: 10.1371/journal.pcbi.1003253

work page doi:10.1371/journal.pcbi.1003253 2013
[15]

Garai, S

Thomas Lane, Daniel P. Russo, Kimberley M. Zorn, Alex M. Clark, Alexandru Korotcov, Valery Tkachenko, Robert C. Reynolds, Alexander L. Perryman, Joel S. Freundlich, and Sean Ekins. Comparing and validating machine learning models for mycobacterium tuber- culosis drug discovery.Molecular Pharmaceutics, 15(10):4346–4360, 2018. doi: 10.1021/acs. molpharmaceu...

work page doi:10.1021/acs 2018
[16]

Genome-wide functional screening of drug-resistance genes in plasmodium falciparum.Nature Communications, 13:6163, 2022

Shiroh Iwanaga, Rie Kubota, Tsubasa Nishi, Sumalee Kamchonwongpaisan, Somdet Srichairatanakool, Naoaki Shinzawa, Din Syafruddin, Masao Yuda, and Chairat Uthaipibull. Genome-wide functional screening of drug-resistance genes in plasmodium falciparum.Nature Communications, 13:6163, 2022. doi: 10.1038/s41467-022-33804-w

work page doi:10.1038/s41467-022-33804-w 2022
[17]

Cluster Computing 6(3), 215–226 (Jul 2003), https://doi.org/10.1023/A: 1023588520138

Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001. doi: 10.1023/A: 1010933404324

work page doi:10.1023/a: 2001
[18]

Extremely randomized trees.Machine Learning, 63(1):3–42, 2006

Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees.Machine Learning, 63(1):3–42, 2006. doi: 10.1007/s10994-006-6226-1

work page doi:10.1007/s10994-006-6226-1 2006
[19]

, month = oct, year =

Jerome H. Friedman. Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 29(5):1189–1232, 2001. doi: 10.1214/aos/1013203451

work page doi:10.1214/aos/1013203451 2001
[20]

Cortes, V

Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine Learning, 20:273– 297, 1995. doi: 10.1007/BF00994018

work page doi:10.1007/bf00994018 1995
[21]

Chen and C

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016. doi: 10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016
[22]

Extended-connectivity fingerprints.J

David Rogers and Mathew Hahn. Extended-connectivity fingerprints.Journal of Chemical Information and Modeling, 50(5):742–754, 2010. doi: 10.1021/ci100050t. 17

work page doi:10.1021/ci100050t 2010
[23]

Sereina Riniker and Gregory A. Landrum. Similarity maps – a visualization strategy for molecular fingerprints and machine-learning methods.Journal of Cheminformatics, 5:43, 2013. doi: 10.1186/1758-2946-5-43

work page doi:10.1186/1758-2946-5-43 2013
[24]

Schoenholz, Patrick F

Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. InProceedings of the 34th International Con- ference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1263–1272, 2017. URLhttps://proceedings.mlr.press/v70/gilmer17a.html

work page 2017
[25]

Geometric deep learning for molecular property prediction: A review.Nature Machine Intelligence, 2023

Xiaomin Fang, Lihang Liu, et al. Geometric deep learning for molecular property prediction: A review.Nature Machine Intelligence, 2023

work page 2023
[26]

Semi-Supervised Classification with Graph Convolutional Networks

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, 2017. URLhttps:// arxiv.org/abs/1609.02907

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Graph Attention Networks

Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. InInternational Conference on Learning Repre- sentations, 2018. URLhttps://arxiv.org/abs/1710.10903

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

How Powerful are Graph Neural Networks?

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? InInternational Conference on Learning Representations, 2019. URLhttps: //arxiv.org/abs/1810.00826

work page internal anchor Pith review Pith/arXiv arXiv 2019
[29]

Ligandformer: A Graph Neural Network for Predicting Compound Property with Robust Interpretation

Jinjiang Guo, Qi Liu, Han Guo, and Xi Lu. Ligandformer: A graph neural network for predicting compound property with robust interpretation, 2022. URLhttps://arxiv.org/ abs/2202.10873

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Chawla, Olaf Wiest, and Xiangliang Zhang

Taicheng Guo, Kehan Guo, Bozhao Nan, Zixing Liang, Zhichun Guo, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks.arXiv preprint arXiv:2305.18365, 2023. doi: 10. 48550/arXiv.2305.18365

work page arXiv 2023
[31]

Weininger

David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules.Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988. doi: 10.1021/ci00057a005

work page doi:10.1021/ci00057a005 1988
[32]

Bemis and Mark A

Guy W. Bemis and Mark A. Murcko. The properties of known drugs. 1. molecular frameworks. Journal of Medicinal Chemistry, 39(15):2887–2893, 1996. doi: 10.1021/jm9602928

work page doi:10.1021/jm9602928 1996
[33]

Best practices for qsar model development, validation, and exploitation

Alexander Tropsha. Best practices for qsar model development, validation, and exploitation. Molecular Informatics, 29(6–7):476–488, 2010. doi: 10.1002/minf.201000061

work page doi:10.1002/minf.201000061 2010
[34]

Drug discovery with explain- able artificial intelligence.Nature Machine Intelligence, 2:573–584, 2020

Jos ’e Jim ’enez-Luna, Francesca Grisoni, and Gisbert Schneider. Drug discovery with explain- able artificial intelligence.Nature Machine Intelligence, 2:573–584, 2020. doi: 10.1038/ s42256-020-00236-4. 18

work page 2020

[1] [1]

Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S

Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. Moleculenet: A benchmark for molec- ular machine learning.Chemical Science, 9:513–530, 2018. doi: 10.1039/C7SC02664A

work page doi:10.1039/c7sc02664a 2018

[2] [2]

Huang, T

Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Connor W. Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. URLhttps:// arxiv.org/a...

work page arXiv 2021

[3] [3]

Bronskill, Krzysztof Maziarz, Henryk Misztela, Julien Lanini, Marwin Segler, Nadine Schneider, and Marc Brockschmidt

Megan Stanley, John F. Bronskill, Krzysztof Maziarz, Henryk Misztela, Julien Lanini, Marwin Segler, Nadine Schneider, and Marc Brockschmidt. Fs-mol: A few-shot learning dataset of molecules. InNeurIPS Datasets and Benchmarks Track, 2021. URLhttps://openreview. net/forum?id=701FtuyLlAd

work page 2021

[4] [4]

Limitations of representation learning in small molecule property prediction.Nature Communications, 14:6394, 2023

Ana Laura Dias, Latimah Bustillo, and Tiago Rodrigues. Limitations of representation learning in small molecule property prediction.Nature Communications, 14:6394, 2023. doi: 10.1038/ s41467-023-41967-3. URLhttps://www.nature.com/articles/s41467-023-41967-3

work page 2023

[5] [5]

Jun Xia, Lecheng Zhang, Xiao Zhu, and Stan Z. Li. Why deep models often cannot beat non-deep counterparts on molecular property prediction?, 2023. URLhttps://arxiv.org/ abs/2306.17702

work page arXiv 2023

[6] [6]

Benchmarking ma- chine learning in admet predictions: The practical impact of feature representations in ligand- based models.Journal of Cheminformatics, 17:108, 2025

Gintautas Kamuntavicius, Tanya Paquet, Orestis Bastas, Dainius Salkauskas, Alvaro Prat, Hisham Abdel Aty, Aurimas Pabrinkis, Povilas Norvaisas, and Roy Tal. Benchmarking ma- chine learning in admet predictions: The practical impact of feature representations in ligand- based models.Journal of Cheminformatics, 17:108, 2025. doi: 10.1186/s13321-025-01041-0....

work page doi:10.1186/s13321-025-01041-0 2025

[7] [7]

ChemBERTa: large -scale self -supervised pretraining fo r molecular property prediction

Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta: Large-scale self- supervised pretraining for molecular property prediction, 2020. URLhttps://arxiv.org/ abs/2010.09885

work page arXiv 2020

[8] [8]

ChemBERTa- 2: Towards chemical foundation models.arXiv preprint arXiv:2209.01712, 2022

Walid Ahmad, Eric Simon, Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta-2: Towards chemical foundation models, 2022. URLhttps://arxiv.org/abs/ 2209.01712

work page arXiv 2022

[9] [9]

Large-scale chemical language representations capture molecular structure and properties.Nature Machine Intelligence, 2022

Jerret Ross, Brian Belgodere, Vijil Chenthamarakshan, Inkit Padhi, Youssef Mroueh, and Payel Das. Large-scale chemical language representations capture molecular structure and properties.Nature Machine Intelligence, 2022. URLhttps://www.nature.com/articles/ s42256-022-00580-7

work page 2022

[10] [10]

Tice, Christopher P

Raymond R. Tice, Christopher P. Austin, Robert J. Kavlock, and John R. Bucher. Tox21 challenge to build predictive models of nuclear receptor and stress response pathways as medi- ated by exposure to environmental chemicals and drugs.Frontiers in Environmental Science, 16

work page

[11] [11]

URLhttps://www.frontiersin.org/journals/environmental-science/articles/ 10.3389/fenvs.2015.00085/full

work page doi:10.3389/fenvs.2015.00085/full 2015

[12] [12]

Qsar modeling of tox21 challenge stress response and nuclear receptor signaling toxicity assays.Frontiers in Environmental Science, 2016

Andreas Mayr, Gunter Klambauer, Thomas Unterthiner, and Sepp Hochreiter. Qsar modeling of tox21 challenge stress response and nuclear receptor signaling toxicity assays.Frontiers in Environmental Science, 2016. URLhttps://www.frontiersin.org/articles/10.3389/ fenvs.2016.00003/full

work page arXiv 2016

[13] [13]

Lemenze, Emily C

Poonam Chitale, Alexander D. Lemenze, Emily C. Fogarty, Avi Shah, Courtney Grady, Aubrey R. Odom-Mabey, W. Evan Johnson, Jason H. Yang, A. Murat Eren, Roland Brosch, Pradeep Kumar, and David Alland. A comprehensive update to the mycobac- terium tuberculosis h37rv reference genome.Nature Communications, 13:7068, 2022. doi: 10.1038/s41467-022-34853-x

work page doi:10.1038/s41467-022-34853-x 2022

[14] [14]

Wallace, Vineet Kumar, Ursula Pieper, Andrej Sali, Jeremy R

Francisco Mart ’inez-Jim ’enez, George Papadatos, Li Yang, Iain M. Wallace, Vineet Kumar, Ursula Pieper, Andrej Sali, Jeremy R. Brown, John P. Overington, and Marc A. Marti-Renom. Target prediction for an open access set of compounds active against mycobacterium tuberculosis.PLoS Computational Biology, 9(10):e1003253, 2013. doi: 10.1371/journal.pcbi.1003253

work page doi:10.1371/journal.pcbi.1003253 2013

[15] [15]

Garai, S

Thomas Lane, Daniel P. Russo, Kimberley M. Zorn, Alex M. Clark, Alexandru Korotcov, Valery Tkachenko, Robert C. Reynolds, Alexander L. Perryman, Joel S. Freundlich, and Sean Ekins. Comparing and validating machine learning models for mycobacterium tuber- culosis drug discovery.Molecular Pharmaceutics, 15(10):4346–4360, 2018. doi: 10.1021/acs. molpharmaceu...

work page doi:10.1021/acs 2018

[16] [16]

Genome-wide functional screening of drug-resistance genes in plasmodium falciparum.Nature Communications, 13:6163, 2022

Shiroh Iwanaga, Rie Kubota, Tsubasa Nishi, Sumalee Kamchonwongpaisan, Somdet Srichairatanakool, Naoaki Shinzawa, Din Syafruddin, Masao Yuda, and Chairat Uthaipibull. Genome-wide functional screening of drug-resistance genes in plasmodium falciparum.Nature Communications, 13:6163, 2022. doi: 10.1038/s41467-022-33804-w

work page doi:10.1038/s41467-022-33804-w 2022

[17] [17]

Cluster Computing 6(3), 215–226 (Jul 2003), https://doi.org/10.1023/A: 1023588520138

Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001. doi: 10.1023/A: 1010933404324

work page doi:10.1023/a: 2001

[18] [18]

Extremely randomized trees.Machine Learning, 63(1):3–42, 2006

Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees.Machine Learning, 63(1):3–42, 2006. doi: 10.1007/s10994-006-6226-1

work page doi:10.1007/s10994-006-6226-1 2006

[19] [19]

, month = oct, year =

Jerome H. Friedman. Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 29(5):1189–1232, 2001. doi: 10.1214/aos/1013203451

work page doi:10.1214/aos/1013203451 2001

[20] [20]

Cortes, V

Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine Learning, 20:273– 297, 1995. doi: 10.1007/BF00994018

work page doi:10.1007/bf00994018 1995

[21] [21]

Chen and C

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016. doi: 10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016

[22] [22]

Extended-connectivity fingerprints.J

David Rogers and Mathew Hahn. Extended-connectivity fingerprints.Journal of Chemical Information and Modeling, 50(5):742–754, 2010. doi: 10.1021/ci100050t. 17

work page doi:10.1021/ci100050t 2010

[23] [23]

Sereina Riniker and Gregory A. Landrum. Similarity maps – a visualization strategy for molecular fingerprints and machine-learning methods.Journal of Cheminformatics, 5:43, 2013. doi: 10.1186/1758-2946-5-43

work page doi:10.1186/1758-2946-5-43 2013

[24] [24]

Schoenholz, Patrick F

Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. InProceedings of the 34th International Con- ference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1263–1272, 2017. URLhttps://proceedings.mlr.press/v70/gilmer17a.html

work page 2017

[25] [25]

Geometric deep learning for molecular property prediction: A review.Nature Machine Intelligence, 2023

Xiaomin Fang, Lihang Liu, et al. Geometric deep learning for molecular property prediction: A review.Nature Machine Intelligence, 2023

work page 2023

[26] [26]

Semi-Supervised Classification with Graph Convolutional Networks

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, 2017. URLhttps:// arxiv.org/abs/1609.02907

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

Graph Attention Networks

Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. InInternational Conference on Learning Repre- sentations, 2018. URLhttps://arxiv.org/abs/1710.10903

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

How Powerful are Graph Neural Networks?

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? InInternational Conference on Learning Representations, 2019. URLhttps: //arxiv.org/abs/1810.00826

work page internal anchor Pith review Pith/arXiv arXiv 2019

[29] [29]

Ligandformer: A Graph Neural Network for Predicting Compound Property with Robust Interpretation

Jinjiang Guo, Qi Liu, Han Guo, and Xi Lu. Ligandformer: A graph neural network for predicting compound property with robust interpretation, 2022. URLhttps://arxiv.org/ abs/2202.10873

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [30]

Chawla, Olaf Wiest, and Xiangliang Zhang

Taicheng Guo, Kehan Guo, Bozhao Nan, Zixing Liang, Zhichun Guo, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks.arXiv preprint arXiv:2305.18365, 2023. doi: 10. 48550/arXiv.2305.18365

work page arXiv 2023

[31] [31]

Weininger

David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules.Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988. doi: 10.1021/ci00057a005

work page doi:10.1021/ci00057a005 1988

[32] [32]

Bemis and Mark A

Guy W. Bemis and Mark A. Murcko. The properties of known drugs. 1. molecular frameworks. Journal of Medicinal Chemistry, 39(15):2887–2893, 1996. doi: 10.1021/jm9602928

work page doi:10.1021/jm9602928 1996

[33] [33]

Best practices for qsar model development, validation, and exploitation

Alexander Tropsha. Best practices for qsar model development, validation, and exploitation. Molecular Informatics, 29(6–7):476–488, 2010. doi: 10.1002/minf.201000061

work page doi:10.1002/minf.201000061 2010

[34] [34]

Drug discovery with explain- able artificial intelligence.Nature Machine Intelligence, 2:573–584, 2020

Jos ’e Jim ’enez-Luna, Francesca Grisoni, and Gisbert Schneider. Drug discovery with explain- able artificial intelligence.Nature Machine Intelligence, 2:573–584, 2020. doi: 10.1038/ s42256-020-00236-4. 18

work page 2020