Recognition: unknown
Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
Pith reviewed 2026-05-07 11:31 UTC · model grok-4.3
The pith
Compact classical and graph models outperform larger pretrained ones on most molecular prediction tasks in a large benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across 167,056 held-out evaluations on 22 molecular property and activity endpoints, classical ML models such as random forests on ECFP4 fingerprints and ExtraTrees on RDKit descriptors win ten primary-metric tasks, GNNs such as GIN and Ligandformer win nine, and pretrained molecular sequence models such as MoLFormer and ChemBERTa2 win three. Rule-based SAR reasoning with large language models does not win under the prespecified primary metrics, although it shows some gains when supplied with training-fold knowledge. These outcomes indicate that compact, specialized models remain highly effective, performance differences among model classes are often modest and endpoint-dependent, and larger
What carries the argument
Structure-similarity-separated five-fold cross-validation benchmark on 22 endpoints comparing classical ML, GNNs, pretrained sequence models, and rule-based LLM baselines.
If this is right
- Performance gains from model class are modest and vary strongly with the biological endpoint.
- Compact specialized models can match or exceed larger general models for standard predictive tasks.
- Pretrained molecular models do not deliver a universal advantage in supervised property prediction.
- Large models may still be useful for zero-shot reasoning and SAR interpretation rather than direct prediction.
- Model selection in drug discovery should prioritize alignment of representation, inductive bias, and data regime over scale.
Where Pith is reading between the lines
- Drug-discovery teams could reduce compute costs by preferring compact models for routine property prediction while reserving larger models for hypothesis generation.
- Future benchmarks should include more internal pharmaceutical data and real-world deployment metrics to test whether the observed pattern holds outside public datasets.
- The modest performance gaps suggest that hybrid approaches combining classical fingerprints with lightweight graph layers may be sufficient for many endpoints.
Load-bearing premise
The chosen endpoints, similarity-separated validation protocol, and selected model implementations fairly represent typical molecular prediction problems without systematically favoring any model class.
What would settle it
A follow-up study on the same or expanded endpoints that uses a different validation split or includes more diverse external test sets and finds larger pretrained models winning a clear majority of tasks under the same primary metrics.
Figures
read the original abstract
The rapid growth of molecular foundation models and general-purpose large language models has encouraged a scale-centric view of artificial intelligence in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and task-specific graph neural networks (GNNs). We test this assumption on 22 molecular property and activity endpoints, including public ADMET and Tox21 benchmarks and two internal anti-infective activity datasets. Across 167,056 held-out task--molecule evaluations under structure-similarity-separated five-fold cross-validation (37,756 ADMET, 77,946 Tox21, 49,266 anti-TB and 2,088 antimalaria), classical machine-learning (ML) models such as RF(ECFP4) and ExtraTrees(RDKit descriptors) win ten primary-metric tasks, GNNs such as GIN and Ligandformer win nine, and pretrained molecular sequence models such as MoLFormer and ChemBERTa2 win three. Rule-based SAR reasoning baselines, represented by GPT5.5-SAR and Opus4.7-SAR, do not win under the prespecified primary metrics, although train-fold-derived SAR knowledge provides measurable but uneven gains for SAR reasoning and interpretation. These results indicate that compact, specialized models remain highly effective for molecular property and activity prediction. The performance differences among classical ML, GNN and pretrained sequence models are often modest and endpoint-dependent, whereas larger or more general models do not provide a universal predictive advantage. Large models may still add value for zero-shot reasoning, SAR interpretation and hypothesis generation, but the results suggest that predictive performance depends on the alignment among molecular representation, inductive bias, data regime, endpoint biology and validation protocol.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript benchmarks classical ML (RF(ECFP4), ExtraTrees(RDKit)), GNNs (GIN, Ligandformer), and pretrained sequence models (MoLFormer, ChemBERTa2) plus rule-based SAR baselines on 22 molecular property/activity endpoints (public ADMET, Tox21, and two internal anti-infective sets). Using structure-similarity-separated 5-fold CV across 167,056 held-out evaluations, it reports classical ML winning 10 primary-metric tasks, GNNs winning 9, and sequence models winning 3, concluding that larger or more general models do not confer a universal predictive advantage and that compact specialized models remain highly effective, with differences being modest and endpoint-dependent.
Significance. If the results are robust, the work supplies a large-scale empirical counterpoint to scale-centric assumptions in AI-driven drug discovery. The volume of evaluations (37k ADMET + 78k Tox21 + 51k internal) and inclusion of both public and proprietary endpoints provide concrete data on when representation-inductive-bias alignment matters more than model size. The explicit separation of predictive performance from zero-shot/SAR-interpretation uses of large models is a useful distinction for practitioners.
major comments (2)
- [Abstract] Abstract and validation protocol: the central claim that classical ML wins 10 tasks versus 3 for pretrained sequence models rests on the structure-similarity-separated 5-fold CV. Because the split metric is likely computed from fingerprints or descriptors that directly overlap with the input representations of RF(ECFP4) and ExtraTrees(RDKit), the held-out folds may be systematically easier for classical models than for SMILES-sequence models whose pretraining does not optimize for the same local substructure similarity. This coupling risks confounding the reported win distribution with validation design rather than intrinsic scaling behavior.
- [Methods] Methods (CV and model details): the manuscript should specify the exact similarity metric and threshold used for fold separation, report per-endpoint performance tables with confidence intervals or standard deviations across folds, and include at least one control experiment (e.g., random splits or similarity-agnostic splits) to test whether the observed advantage for fingerprint-based models persists.
minor comments (1)
- [Abstract] Abstract: the phrase 'train-fold-derived SAR knowledge provides measurable but uneven gains' would benefit from a quantitative summary or supplementary table showing the magnitude of those gains across endpoints.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our benchmark study. We address the concerns about the validation protocol and methods below, with revisions to improve transparency and reporting.
read point-by-point responses
-
Referee: [Abstract] Abstract and validation protocol: the central claim that classical ML wins 10 tasks versus 3 for pretrained sequence models rests on the structure-similarity-separated 5-fold CV. Because the split metric is likely computed from fingerprints or descriptors that directly overlap with the input representations of RF(ECFP4) and ExtraTrees(RDKit), the held-out folds may be systematically easier for classical models than for SMILES-sequence models whose pretraining does not optimize for the same local substructure similarity. This coupling risks confounding the reported win distribution with validation design rather than intrinsic scaling behavior.
Authors: The structure-similarity split follows standard practice in molecular machine learning to simulate realistic generalization to novel chemical matter and avoid leakage from near-duplicates. We have revised the Methods to explicitly state that fold separation uses Tanimoto similarity on ECFP4 fingerprints (threshold now specified). While this representation overlaps with one classical baseline, the same partitions are used for all models, and pretrained sequence models are expected to capture substructure patterns from their large-scale pretraining. The endpoint-dependent results and modest effect sizes indicate that the win distribution reflects inductive bias alignment rather than an artifact of the split. We added a clarifying sentence to the abstract and a short discussion paragraph on validation design. revision: partial
-
Referee: [Methods] Methods (CV and model details): the manuscript should specify the exact similarity metric and threshold used for fold separation, report per-endpoint performance tables with confidence intervals or standard deviations across folds, and include at least one control experiment (e.g., random splits or similarity-agnostic splits) to test whether the observed advantage for fingerprint-based models persists.
Authors: We agree on the need for greater detail. The revised Methods now specifies the exact metric (Tanimoto similarity on ECFP4 fingerprints) and the threshold used for fold separation. We have added supplementary tables reporting mean performance and standard deviation across the five folds for every endpoint, model, and primary/secondary metric. For the control, we performed an additional analysis using random splits; under this protocol all models improve but the relative ordering (classical ML and GNNs competitive with or ahead of sequence models on most tasks) is preserved. These results are reported in a new subsection and support that the main findings are not driven solely by the similarity-based split. revision: yes
Circularity Check
No circularity: purely empirical benchmark with direct data-driven comparisons
full rationale
The paper reports model performance rankings across 22 endpoints under a fixed structure-similarity-separated 5-fold CV protocol. No derivations, equations, or fitted parameters are present that could reduce to self-definition or prediction-by-construction. Central claims rest on tabulated win counts (RF/ECFP4 wins 10, GNNs win 9, sequence models win 3) obtained from explicit held-out evaluations, not from any ansatz, uniqueness theorem, or self-citation chain. The validation design is stated explicitly and applied uniformly; any potential representation-split alignment is an external methodological concern, not a reduction of the reported results to their own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. Moleculenet: A benchmark for molec- ular machine learning.Chemical Science, 9:513–530, 2018. doi: 10.1039/C7SC02664A
-
[2]
Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Connor W. Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. URLhttps:// arxiv.org/a...
-
[3]
Bronskill, Krzysztof Maziarz, Henryk Misztela, Julien Lanini, Marwin Segler, Nadine Schneider, and Marc Brockschmidt
Megan Stanley, John F. Bronskill, Krzysztof Maziarz, Henryk Misztela, Julien Lanini, Marwin Segler, Nadine Schneider, and Marc Brockschmidt. Fs-mol: A few-shot learning dataset of molecules. InNeurIPS Datasets and Benchmarks Track, 2021. URLhttps://openreview. net/forum?id=701FtuyLlAd
2021
-
[4]
Limitations of representation learning in small molecule property prediction.Nature Communications, 14:6394, 2023
Ana Laura Dias, Latimah Bustillo, and Tiago Rodrigues. Limitations of representation learning in small molecule property prediction.Nature Communications, 14:6394, 2023. doi: 10.1038/ s41467-023-41967-3. URLhttps://www.nature.com/articles/s41467-023-41967-3
2023
- [5]
-
[6]
Gintautas Kamuntavicius, Tanya Paquet, Orestis Bastas, Dainius Salkauskas, Alvaro Prat, Hisham Abdel Aty, Aurimas Pabrinkis, Povilas Norvaisas, and Roy Tal. Benchmarking ma- chine learning in admet predictions: The practical impact of feature representations in ligand- based models.Journal of Cheminformatics, 17:108, 2025. doi: 10.1186/s13321-025-01041-0....
-
[7]
Seyone Chithrananda, Gabriel Grand, Bharath Ramsun- dar, et al
Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta: Large-scale self- supervised pretraining for molecular property prediction, 2020. URLhttps://arxiv.org/ abs/2010.09885
-
[8]
Seyone Chithrananda, Gabriel Grand, Bharath Ramsun- dar, et al
Walid Ahmad, Eric Simon, Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta-2: Towards chemical foundation models, 2022. URLhttps://arxiv.org/abs/ 2209.01712
-
[9]
Large-scale chemical language representations capture molecular structure and properties.Nature Machine Intelligence, 2022
Jerret Ross, Brian Belgodere, Vijil Chenthamarakshan, Inkit Padhi, Youssef Mroueh, and Payel Das. Large-scale chemical language representations capture molecular structure and properties.Nature Machine Intelligence, 2022. URLhttps://www.nature.com/articles/ s42256-022-00580-7
2022
-
[10]
Tice, Christopher P
Raymond R. Tice, Christopher P. Austin, Robert J. Kavlock, and John R. Bucher. Tox21 challenge to build predictive models of nuclear receptor and stress response pathways as medi- ated by exposure to environmental chemicals and drugs.Frontiers in Environmental Science, 16
-
[11]
URLhttps://www.frontiersin.org/journals/environmental-science/articles/ 10.3389/fenvs.2015.00085/full
-
[12]
Andreas Mayr, Gunter Klambauer, Thomas Unterthiner, and Sepp Hochreiter. Qsar modeling of tox21 challenge stress response and nuclear receptor signaling toxicity assays.Frontiers in Environmental Science, 2016. URLhttps://www.frontiersin.org/articles/10.3389/ fenvs.2016.00003/full
-
[13]
Poonam Chitale, Alexander D. Lemenze, Emily C. Fogarty, Avi Shah, Courtney Grady, Aubrey R. Odom-Mabey, W. Evan Johnson, Jason H. Yang, A. Murat Eren, Roland Brosch, Pradeep Kumar, and David Alland. A comprehensive update to the mycobac- terium tuberculosis h37rv reference genome.Nature Communications, 13:7068, 2022. doi: 10.1038/s41467-022-34853-x
-
[14]
Wallace, Vineet Kumar, Ursula Pieper, Andrej Sali, Jeremy R
Francisco Mart ’inez-Jim ’enez, George Papadatos, Li Yang, Iain M. Wallace, Vineet Kumar, Ursula Pieper, Andrej Sali, Jeremy R. Brown, John P. Overington, and Marc A. Marti-Renom. Target prediction for an open access set of compounds active against mycobacterium tuberculosis.PLoS Computational Biology, 9(10):e1003253, 2013. doi: 10.1371/journal.pcbi.1003253
-
[15]
Thomas Lane, Daniel P. Russo, Kimberley M. Zorn, Alex M. Clark, Alexandru Korotcov, Valery Tkachenko, Robert C. Reynolds, Alexander L. Perryman, Joel S. Freundlich, and Sean Ekins. Comparing and validating machine learning models for mycobacterium tuber- culosis drug discovery.Molecular Pharmaceutics, 15(10):4346–4360, 2018. doi: 10.1021/acs. molpharmaceu...
work page doi:10.1021/acs 2018
-
[16]
Shiroh Iwanaga, Rie Kubota, Tsubasa Nishi, Sumalee Kamchonwongpaisan, Somdet Srichairatanakool, Naoaki Shinzawa, Din Syafruddin, Masao Yuda, and Chairat Uthaipibull. Genome-wide functional screening of drug-resistance genes in plasmodium falciparum.Nature Communications, 13:6163, 2022. doi: 10.1038/s41467-022-33804-w
-
[17]
Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001. doi: 10.1023/A: 1010933404324
work page doi:10.1023/a: 2001
-
[18]
Extremely randomized trees.Machine Learning, 63(1):3–42, 2006
Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees.Machine Learning, 63(1):3–42, 2006. doi: 10.1007/s10994-006-6226-1
-
[19]
Jerome H. Friedman. Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 29(5):1189–1232, 2001. doi: 10.1214/aos/1013203451
-
[20]
Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine Learning, 20:273– 297, 1995. doi: 10.1007/BF00994018
-
[21]
XGBoost: A Scalable Tree Boosting System
Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016. doi: 10.1145/2939672.2939785
-
[22]
Extended-connectivity fingerprints
David Rogers and Mathew Hahn. Extended-connectivity fingerprints.Journal of Chemical Information and Modeling, 50(5):742–754, 2010. doi: 10.1021/ci100050t. 17
-
[23]
Sereina Riniker and Gregory A. Landrum. Similarity maps – a visualization strategy for molecular fingerprints and machine-learning methods.Journal of Cheminformatics, 5:43, 2013. doi: 10.1186/1758-2946-5-43
-
[24]
Schoenholz, Patrick F
Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. InProceedings of the 34th International Con- ference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1263–1272, 2017. URLhttps://proceedings.mlr.press/v70/gilmer17a.html
2017
-
[25]
Geometric deep learning for molecular property prediction: A review.Nature Machine Intelligence, 2023
Xiaomin Fang, Lihang Liu, et al. Geometric deep learning for molecular property prediction: A review.Nature Machine Intelligence, 2023
2023
-
[26]
Semi-Supervised Classification with Graph Convolutional Networks
Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, 2017. URLhttps:// arxiv.org/abs/1609.02907
work page internal anchor Pith review arXiv 2017
-
[27]
Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. InInternational Conference on Learning Repre- sentations, 2018. URLhttps://arxiv.org/abs/1710.10903
work page internal anchor Pith review arXiv 2018
-
[28]
How Powerful are Graph Neural Networks?
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? InInternational Conference on Learning Representations, 2019. URLhttps: //arxiv.org/abs/1810.00826
work page internal anchor Pith review arXiv 2019
-
[29]
Ligandformer: A Graph Neural Network for Predicting Compound Property with Robust Interpretation
Jinjiang Guo, Qi Liu, Han Guo, and Xi Lu. Ligandformer: A graph neural network for predicting compound property with robust interpretation, 2022. URLhttps://arxiv.org/ abs/2202.10873
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Taicheng Guo, Kehan Guo, Bozhao Nan, Zixing Liang, Zhichun Guo, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks.arXiv preprint arXiv:2305.18365, 2023. doi: 10. 48550/arXiv.2305.18365
-
[31]
Smiles, a chemical language and information system
David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules.Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988. doi: 10.1021/ci00057a005
-
[32]
Guy W. Bemis and Mark A. Murcko. The properties of known drugs. 1. molecular frameworks. Journal of Medicinal Chemistry, 39(15):2887–2893, 1996. doi: 10.1021/jm9602928
-
[33]
Best practices for qsar model development, validation, and exploitation
Alexander Tropsha. Best practices for qsar model development, validation, and exploitation. Molecular Informatics, 29(6–7):476–488, 2010. doi: 10.1002/minf.201000061
-
[34]
Drug discovery with explain- able artificial intelligence.Nature Machine Intelligence, 2:573–584, 2020
Jos ’e Jim ’enez-Luna, Francesca Grisoni, and Gisbert Schneider. Drug discovery with explain- able artificial intelligence.Nature Machine Intelligence, 2:573–584, 2020. doi: 10.1038/ s42256-020-00236-4. 18
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.