Closed-loop Auto Research for Molecular Property Prediction: Discovering and Certifying Generalizable Improvements

Chenyan Xiong; Guolin Ke; Jingjie Ning; Ji Zeng; Xiaochuan Li

arxiv: 2606.22731 · v1 · pith:PK2BUOS4new · submitted 2026-06-22 · 💻 cs.AI · cs.MA

Closed-loop Auto Research for Molecular Property Prediction: Discovering and Certifying Generalizable Improvements

Jingjie Ning , Xiaochuan Li , Ji Zeng , Chenyan Xiong , Guolin Ke This is my paper

Pith reviewed 2026-06-26 09:10 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords molecular property predictionlanguage model agentsautomated researchheld-out certificationexternal data acquisitionfeature and model searchbenchmark evaluation

0 comments

The pith

A routed pipeline selects each molecular endpoint's best validation axis and delivers positive held-out gains of 0.013 to 0.042 across three benchmark suites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether language-model agents that edit molecular features, model code, and external data can produce improvements that survive evaluation on test labels the search process never sees. It isolates three search axes under a file-level ablation that credits each gain to one change, runs the process on 36 endpoints from TDC, Polaris, and MoleculeNet, and then measures the chosen configurations on held-out tests. A pipeline that routes every endpoint to its strongest validation axis records small positive test gains, while single-axis searches sometimes collapse and a standard AutoML baseline fails to match the agent's model edits. The work isolates the concrete lesson that discovery on a validation proxy must be followed by separate certification on unseen labels.

Core claim

A routed pipeline that assigns each endpoint to its best validation axis (features, models, or external evidence) produces held-out test gains of 0.013 on TDC, 0.011 on Polaris, and 0.042 on MoleculeNet; the transferable axis varies by suite, model-search gains drop from 0.041 on validation to 0.003 on test, curated external data can lift specific endpoints such as CYP2C9-substrate by 0.17 when passed through an overlap filter, and an AutoML control reaches only 0.006 against the agent's 0.042.

What carries the argument

The routed pipeline with file-level ablation lock that attributes each performance change to exactly one axis (features, models, or external evidence) over a fixed baseline.

If this is right

The axis that transfers differs by benchmark suite: external data on TDC, models on Polaris, and both features and models on MoleculeNet.
Individual searches can produce large validation gains that largely disappear on held-out test labels.
Curated external data improves performance on particular endpoints once the overlap filter is applied.
The language-model agent's code edits outperform a matched automated machine learning control that does not intervene at the source-code level.
The pipeline remains competitive with an 84M-parameter pretrained 3D model on the shared training split.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation between proxy-driven discovery and held-out certification could be applied to automated research loops in other scientific domains that optimize against validation proxies.
Stronger leakage diagnostics beyond file-overlap statistics may be required before external-data gains can be trusted to generalize.
Extending the agent action space to include hyperparameter schedules or multi-task training objectives might change which axis transfers most reliably.

Load-bearing premise

The overlap-based contamination filter is treated as adequate to block leakage when external data files are admitted, even though the paper states the filter is necessary but not sufficient.

What would settle it

Re-running the identical routed pipeline on a fresh collection of endpoints whose external data sources share zero structures with the test sets and checking whether the reported positive held-out gains remain or vanish.

read the original abstract

Closed-loop Auto Research extends automated machine learning from fixed-dataset fitting to changing the research workflow, with language-model agents editing representations and model code and acquiring external evidence. Molecular property prediction spans many small endpoints. We ask whether this action space yields improvements generalizing beyond the validation signal selecting them. We isolate three Auto Research axes, features, models, and external evidence, under a file-level ablation lock attributing each gain to one axis over a strong baseline. Across 36 endpoints in three benchmark suites we score each selected configuration once on a held-out test whose labels the search never read. A routed pipeline taking each endpoint's best validation axis reaches positive held-out gains of 0.013, 0.011, and 0.042, the transferable axis differing by suite, data on TDC, model on Polaris, feature and model on MoleculeNet. The largest model-search gain falls from 0.041 on validation to 0.003 on test, while curated data reaches 0.022 but negative 0.019 on test, two non-transfer signatures. Curated external data raises held-out CYP2C9-substrate performance by 0.17 and half-life by 0.08, admitted through a contamination filter rejecting same-source files overlapping 64 to 89 percent of test structures, necessary but not sufficient for transfer. A matched-trial automated machine learning control did not reproduce the agent's code-level model intervention, reaching 0.006 against 0.042, and the pipeline stays competitive with an 84M-parameter pretrained 3D model on the shared training split. The experiments stay within molecular property prediction, but separating discovery from held-out certification is a domain-agnostic lesson for any closed-loop system optimising a proxy for a held-out quantity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The closed-loop agent finds modest held-out gains on some axes but external data improvements carry real leakage risk from the filter the authors call insufficient.

read the letter

The main things to know are that this setup produces routed held-out gains of 0.013, 0.011, and 0.042 across the three suites, with the transferable axis shifting by suite, and that model search and curated data show clear non-transfer signatures like the drop from 0.041 to 0.003.

The new element is the combination of LM agents that edit both code and external data files, followed by file-level ablations that attribute each gain to one axis and then certify on held-out tests the search never reads. They isolate features, models, and external evidence across 36 endpoints, report the non-transfer cases explicitly, and include a matched AMML control that does not reproduce the agent's code-level result. The larger data gains on CYP2C9 and half-life are also presented with the filter details.

This does a reasonable job of trying to separate discovery from certification and of documenting where the proxy signal fails to predict test performance. The domain-agnostic framing about closed-loop systems is straightforward.

The soft spot is the contamination filter. Rejecting same-source files with 64-89% overlap is a start, but the paper itself states it is necessary but not sufficient. That leaves open the possibility that shared substructures, scaffolds, or assay conditions still leak through and drive the external data gains. The overall effect sizes are small, so even modest leakage would change the interpretation.

This paper is for researchers building or evaluating automated discovery pipelines in chemistry or similar proxy-optimization settings. Readers who care about empirical checks on transfer will find concrete numbers and failure cases worth examining. It deserves a serious referee because the ablation design and the non-transfer results are substantive enough to check in detail, even if the leakage concern needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript describes a closed-loop Auto Research system in which language-model agents modify molecular representations, model code, and acquire external data for property prediction tasks. Using an ablation lock to attribute gains to specific axes (features, models, external evidence) and evaluating on held-out tests across 36 endpoints in TDC, Polaris, and MoleculeNet, it reports that a routed selection of the best validation axis yields positive test gains of 0.013 (TDC, data), 0.011 (Polaris, model), and 0.042 (MoleculeNet, feature/model). Some interventions fail to transfer (e.g., model-search gain drops from 0.041 validation to 0.003 test; curated data from 0.022 to -0.019), and external data improves specific endpoints (CYP2C9 by 0.17, half-life by 0.08) after a contamination filter. A matched AutoML control fails to reproduce the agent's model intervention.

Significance. If the held-out gains prove robust, the work supplies a useful case study of separating discovery from certification in automated workflows, with explicit non-transfer examples and an AutoML control providing informative negative results. The domain-agnostic emphasis on proxy optimization versus held-out evaluation is a constructive contribution to closed-loop AutoML research.

major comments (2)

[External evidence results] External evidence results (abstract and corresponding results section): The held-out gains from curated external data (0.17 on CYP2C9-substrate, 0.08 on half-life) are central to demonstrating value in the external-evidence axis. However, the manuscript explicitly states that the contamination filter (rejecting same-source files with 64–89 % test-structure overlap) is “necessary but not sufficient for transfer.” This directly raises the possibility that residual leakage (shared substructures, scaffolds, or assay conditions) accounts for the improvements rather than the Auto Research process, weakening the attribution of generalizable gains.
[Methods] Methods (data splits, ablation lock, and filter implementation): The central transfer claims rest on the precise definition of the file-level ablation lock, the exact computation of the 64–89 % overlap threshold, and the full list of admitted external sources. The provided text does not supply these details or the code, making it impossible to confirm that post-hoc choices or undetected leakage do not affect the reported positive held-out gains of 0.013/0.011/0.042.

minor comments (2)

[Abstract] Abstract: Explicitly map the three gains (0.013, 0.011, 0.042) to the three suites and their transferable axes for immediate clarity.
[Abstract] Abstract: The statement that “the pipeline stays competitive with an 84M-parameter pretrained 3D model” should report the exact metric value and training-split comparison to allow direct evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the need to strengthen claims around external data transfer and to improve methodological transparency. We respond to each major comment below.

read point-by-point responses

Referee: [External evidence results] External evidence results (abstract and corresponding results section): The held-out gains from curated external data (0.17 on CYP2C9-substrate, 0.08 on half-life) are central to demonstrating value in the external-evidence axis. However, the manuscript explicitly states that the contamination filter (rejecting same-source files with 64–89 % test-structure overlap) is “necessary but not sufficient for transfer.” This directly raises the possibility that residual leakage (shared substructures, scaffolds, or assay conditions) accounts for the improvements rather than the Auto Research process, weakening the attribution of generalizable gains.

Authors: We agree that residual leakage via shared substructures or assay conditions cannot be ruled out by the file-level filter alone, and that this limits strong attribution of the 0.17 and 0.08 gains specifically to the Auto Research process. The manuscript already qualifies the filter as “necessary but not sufficient,” and the non-transfer results on other axes are presented precisely to illustrate the difficulty of generalization. In revision we will add an explicit limitations paragraph on this point, temper the abstract language around the external-evidence gains, and include additional post-hoc checks (scaffold overlap statistics and a random-substructure ablation) if they can be completed without new data access. revision: yes
Referee: [Methods] Methods (data splits, ablation lock, and filter implementation): The central transfer claims rest on the precise definition of the file-level ablation lock, the exact computation of the 64–89 % overlap threshold, and the full list of admitted external sources. The provided text does not supply these details or the code, making it impossible to confirm that post-hoc choices or undetected leakage do not affect the reported positive held-out gains of 0.013/0.011/0.042.

Authors: The referee is correct that the current text omits the exact definition of the file-level ablation lock, the overlap-threshold algorithm, and the enumerated external sources. These omissions prevent independent verification. We will expand the Methods section with pseudocode for the ablation lock and filter, the precise overlap metric (Tanimoto on Morgan fingerprints at radius 2), the 64–89 % range derivation, and the list of admitted sources. The full implementation will be released with the camera-ready version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; held-out certification is independent

full rationale

The paper's central results rest on a held-out test set whose labels are never accessed during the search or axis selection process, with explicit reporting of non-transfer cases (e.g., model-search gain dropping from 0.041 to 0.003, curated data from 0.022 to -0.019). The routed pipeline selects on validation but certifies on unseen test data, and the contamination filter is openly described as 'necessary but not sufficient.' No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation. The evaluation chain is self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The work relies on standard supervised learning assumptions and the unstated premise that the LM agent's code edits are causally responsible for observed differences versus the matched AutoML control.

pith-pipeline@v0.9.1-grok · 5872 in / 1251 out tokens · 26874 ms · 2026-06-26T09:10:54.659532+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 15 canonical work pages

[1]

Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes

Ning J, Li X, Zeng J, Kang H, Xiong C. Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes. arXiv preprint. 2026;arXiv:2605.05724

Pith/arXiv arXiv 2026
[2]

ADMET property prediction through combinations of molecular fingerprints

Notwell JH, Wood MW. ADMET property prediction through combinations of molecular fingerprints. arXiv preprint. 2023;arXiv:2310.00174

arXiv 2023
[3]

Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development

Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, et al. Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development. In: Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks; 2021. . 18

2021
[4]

Artifi- cial intelligence foundation for therapeutic science

Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, et al. Artifi- cial intelligence foundation for therapeutic science. Nature Chemical Biology. 2022;18:1033–1036. https://doi.org/10.1038/s41589-022-01131-2

work page doi:10.1038/s41589-022-01131-2 2022
[5]

N.; Gomes, J.; Geniesse, C.; Pappu, A

Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, et al. MoleculeNet: a benchmark for molecular machine learning. Chemical Science. 2018;9(2):513–530. https://doi.org/10.1039/C7SC02664A

work page doi:10.1039/c7sc02664a 2018
[6]

Prospective Vali- dation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective

Fang C, Wang Y, Grater R, Kapadnis S, Black C, Trapa P, et al. Prospective Vali- dation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective. Journal of Chemical Infor- mation and Modeling. 2023;63(11):3263–3274. https://doi.org/10.1021/acs.jcim. 3c00160

work page doi:10.1021/acs.jcim 2023
[7]

Accessed 2026

Polaris.: Biogen adme-fang-v1. Accessed 2026. https://polarishub.io/datasets/ biogen/adme-fang-v1

2026
[8]

ChemBERTa: Large-Scale Self- Supervised Pretraining for Molecular Property Prediction

Chithrananda S, Grand G, Ramsundar B. ChemBERTa: Large-Scale Self- Supervised Pretraining for Molecular Property Prediction. arXiv preprint. 2020;arXiv:2010.09885

arXiv 2020
[9]

Self-Supervised Graph Transformer on Large-Scale Molecular Data

Rong Y, Bian Y, Xu T, Xie W, Wei Y, Huang W, et al. Self-Supervised Graph Transformer on Large-Scale Molecular Data. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 33; 2020

2020
[10]

Molecular contrastive learning of representations via graph neural networks , url =

Wang Y, Wang J, Cao Z, Farimani AB. Molecular contrastive learning of represen- tations via graph neural networks. Nature Machine Intelligence. 2022;4:279–287. https://doi.org/10.1038/s42256-022-00447-x

work page doi:10.1038/s42256-022-00447-x 2022
[11]

Uni-Mol: A Universal 3D Molecular Representation Learning Framework

Zhou G, Gao Z, Ding Q, Zheng H, Xu H, Wei Z, et al. Uni-Mol: A Universal 3D Molecular Representation Learning Framework. In: International Conference on Learning Representations (ICLR); 2023

2023
[12]

CatBoost: unbi- ased boosting with categorical features

Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbi- ased boosting with categorical features. In: Advances in Neural Information Processing Systems (NeurIPS); 2018

2018
[13]

Extended-Connectivity Fingerprints

Rogers D, Hahn M. Extended-Connectivity Fingerprints. Journal of Chem- ical Information and Modeling. 2010;50(5):742–754. https://doi.org/10.1021/ ci100050t

2010
[14]

QSAR: How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets

Gedeck P, Rohde B, Bartels C. QSAR: How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets. Journal of Chemical Information and Modeling. 2006;46(5):1924–1936. https://doi.org/ 10.1021/ci050423u. 19

work page doi:10.1021/ci050423u 2006
[15]

ErG: 2D Pharmacophore Descrip- tions for Scaffold Hopping

Stiefl N, Watson IA, Baumann K, Zaliani A. ErG: 2D Pharmacophore Descrip- tions for Scaffold Hopping. Journal of Chemical Information and Modeling. 2006;46(1):208–220. https://doi.org/10.1021/ci050457y

work page doi:10.1021/ci050457y 2006
[16]

Jacob Cohen

Bran AM, Cox S, Schilter O, Baldassari C, White AD, Schwaller P. Augment- ing large language models with chemistry tools. Nature Machine Intelligence. 2024;6:525–535. https://doi.org/10.1038/s42256-024-00832-8

work page doi:10.1038/s42256-024-00832-8 2024
[17]

Autonomous chemical research with large language models

Boiko DA, MacKnight R, Kline B, Gomes G. Autonomous chemical research with large language models. Nature. 2023;624:570–578. https://doi.org/10.1038/ s41586-023-06792-0

2023
[18]

Bulaong, John E

Swanson K, Wu W, Bulaong NL, Pak JE, Zou J. The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies. Nature. 2025;646:716–723. https://doi. org/10.1038/s41586-025-09442-9

work page doi:10.1038/s41586-025-09442-9 2025
[19]

Accelerating scientific discovery with co-scientist

Gottweis J, Weng WH, Daryin A, et al. Accelerating scientific discovery with Co-Scientist. Nature. 2026;https://doi.org/10.1038/s41586-026-10644-y

work page doi:10.1038/s41586-026-10644-y 2026
[20]

DrugAgent: Automating AI- aided Drug Discovery Programming through LLM Multi-Agent Collaboration

Liu S, Lu Y, Chen S, Hu X, Zhao J, Lu Y, et al. DrugAgent: Automating AI- aided Drug Discovery Programming through LLM Multi-Agent Collaboration. arXiv preprint. 2024;arXiv:2411.15692

arXiv 2024
[21]

MolAgent: Biomolecular Property Estimation in the Agentic Era

G´ omez-Tamayo JC, Tavernier J, Aerts R, Dyubankova N, Van Rompaey D, Menon S, et al. MolAgent: Biomolecular Property Estimation in the Agentic Era. Journal of Chemical Information and Modeling. 2025;65(20):10808–10818. https://doi.org/10.1021/acs.jcim.5c01938

work page doi:10.1021/acs.jcim.5c01938 2025
[22]

Large language models for scientific discovery in molecular property prediction

Zheng Y, Koh HY, Ju J, Nguyen ATN, May LT, Webb GI, et al. Large language models for scientific discovery in molecular property prediction. Nature Machine Intelligence. 2025;7(3):437–447. https://doi.org/10.1038/s42256-025-00994-z

work page doi:10.1038/s42256-025-00994-z 2025
[23]

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

Huang Q, Vora J, Liang P, Leskovec J. MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation. In: Proceedings of the 41st Inter- national Conference on Machine Learning. vol. 235 of Proceedings of Machine Learning Research; 2024. p. 20271–20309

2024
[24]

MLE- bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Chan JS, Chowdhury N, Jaffe O, Aung J, Sherburn D, Mays E, et al. MLE- bench: Evaluating Machine Learning Agents on Machine Learning Engineering. In: International Conference on Learning Representations (ICLR); 2025

2025
[25]

AIDE: AI- Driven Exploration in the Space of Code

Jiang Z, Schmidt D, Srikanth D, Xu D, Kaplan I, Jacenko D, et al. AIDE: AI- Driven Exploration in the Space of Code. arXiv preprint. 2025;arXiv:2502.13138

Pith/arXiv arXiv 2025
[26]

Towards end-to-end automation of AI research

Lu C, Lu C, Lange RT, et al. Towards end-to-end automation of AI research. Nature. 2026;651:914–919. https://doi.org/10.1038/s41586-026-10265-5. 20

work page doi:10.1038/s41586-026-10265-5 2026
[27]

The AI Scientist-v2: Workshop-Level Automated Sci- entific Discovery via Agentic Tree Search

Yamada Y, Akiba T, et al. The AI Scientist-v2: Workshop-Level Automated Sci- entific Discovery via Agentic Tree Search. arXiv preprint. 2025;arXiv:2504.08066

Pith/arXiv arXiv 2025
[28]

FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Wang Z, Zhang X, Goyal A, Pratt S, Ji J, Wu J, et al. FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights. arXiv preprint. 2026;arXiv:2602.02905

arXiv 2026
[29]

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Garikaparthi R, Charkhgard H, Asawa P, Deshpande C, Wang K, Hsu CCY, et al. ResearchGym: Evaluating Language Model Agents on Real-World AI Research. arXiv preprint. 2026;arXiv:2602.15112

arXiv 2026
[30]

ResearchClawBench: Bench- marking Autonomous Agents on End-to-End Paper-Level Research Tasks

Xu M, Yang Y, Li Y, Huang Y, Lin X, Du SS, et al. ResearchClawBench: Bench- marking Autonomous Agents on End-to-End Paper-Level Research Tasks. arXiv preprint. 2026;arXiv:2606.07591

Pith/arXiv arXiv 2026
[31]

SciAgentArena: Benchmarking Multi-Domain Scientific Agents Across Scales

Liu S, Ma S, Zhang H, Yin Y, Zhao Y, Dai J, et al. SciAgentArena: Benchmarking Multi-Domain Scientific Agents Across Scales. arXiv preprint. 2026;arXiv:2606.12736

Pith/arXiv arXiv 2026
[32]

Practical Bayesian Optimization of Machine Learning Algorithms

Snoek J, Larochelle H, Adams RP. Practical Bayesian Optimization of Machine Learning Algorithms. In: Advances in Neural Information Processing Systems. vol. 25; 2012. p. 2951–2959

2012
[33]

Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms

Thornton C, Hutter F, Hoos HH, Leyton-Brown K. Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2013. p. 847–855

2013
[34]

Auto-ADMET: An Effective and Interpretable AutoML Method for Chemical ADMET Property Prediction

de S ’a AGC, Ascher DB. Auto-ADMET: An Effective and Interpretable AutoML Method for Chemical ADMET Property Prediction. arXiv preprint. 2025;arXiv:2502.16378

arXiv 2025
[35]

On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

Cawley GC, Talbot NLC. On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research. 2010;11:2079–2107

2010
[36]

Preserving Statistical Validity in Adaptive Data Analysis

Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A. Preserving Statistical Validity in Adaptive Data Analysis. In: Proceedings of the 47th Annual ACM Symposium on Theory of Computing; 2015. p. 117–126

2015
[37]

The reusable holdout: Preserving validity in adaptive data analysis.Science, 349(6248): 636–638, 2015

Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A. The Reusable Holdout: Preserving Validity in Adaptive Data Analysis. Science. 2015;349(6248):636–638. https://doi.org/10.1126/science.aaa9375

work page doi:10.1126/science.aaa9375 2015
[38]

Lo-Hi: Practical ML Drug Discovery Benchmark

Steshin S. Lo-Hi: Practical ML Drug Discovery Benchmark. In: Advances in Neu- ral Information Processing Systems (NeurIPS) Datasets and Benchmarks Track; 21
[39]

Data splitting to avoid information leakage with DataSAIL

Joeres R, Blumenthal DB, Kalinina OV. Data splitting to avoid information leakage with DataSAIL. Nature Communications. 2025;16:3337. https://doi.org/ 10.1038/s41467-025-58606-8

work page doi:10.1038/s41467-025-58606-8 2025
[40]

: G2-structures and octonion bundles

Kapoor S, Narayanan A. Leakage and the Reproducibility Crisis in Machine- Learning-Based Science. Patterns. 2023;4(9):100804. https://doi.org/10.1016/j. patter.2023.100804

work page doi:10.1016/j 2023
[41]

XGBoost: A Scalable Tree Boosting System

Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016. p. 785–794

2016
[42]

LightGBM: A Highly Efficient Gradient Boosting Decision Tree

Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In: Advances in Neural Information Processing Systems. vol. 30; 2017

2017
[43]

FLAML: A Fast and Lightweight AutoML Library

Wang C, Wu Q, Weimer M, Zhu E. FLAML: A Fast and Lightweight AutoML Library. arXiv preprint. 2019;arXiv:1911.04706

arXiv 2019
[44]

Accessed 2026

Landrum G, et al.: RDKit: Open-source cheminformatics. Accessed 2026. https: //www.rdkit.org. 22 S1 Search trajectories Figure S1 reports the best-so-far aggregate normalised improvement on the TDC ADMET suite for each isolated axis across the budget of one hundred trials per axis, with one additional trial on the feature axis. The curve for each axis is ...

2026

[1] [1]

Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes

Ning J, Li X, Zeng J, Kang H, Xiong C. Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes. arXiv preprint. 2026;arXiv:2605.05724

Pith/arXiv arXiv 2026

[2] [2]

ADMET property prediction through combinations of molecular fingerprints

Notwell JH, Wood MW. ADMET property prediction through combinations of molecular fingerprints. arXiv preprint. 2023;arXiv:2310.00174

arXiv 2023

[3] [3]

Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development

Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, et al. Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development. In: Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks; 2021. . 18

2021

[4] [4]

Artifi- cial intelligence foundation for therapeutic science

Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, et al. Artifi- cial intelligence foundation for therapeutic science. Nature Chemical Biology. 2022;18:1033–1036. https://doi.org/10.1038/s41589-022-01131-2

work page doi:10.1038/s41589-022-01131-2 2022

[5] [5]

N.; Gomes, J.; Geniesse, C.; Pappu, A

Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, et al. MoleculeNet: a benchmark for molecular machine learning. Chemical Science. 2018;9(2):513–530. https://doi.org/10.1039/C7SC02664A

work page doi:10.1039/c7sc02664a 2018

[6] [6]

Prospective Vali- dation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective

Fang C, Wang Y, Grater R, Kapadnis S, Black C, Trapa P, et al. Prospective Vali- dation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective. Journal of Chemical Infor- mation and Modeling. 2023;63(11):3263–3274. https://doi.org/10.1021/acs.jcim. 3c00160

work page doi:10.1021/acs.jcim 2023

[7] [7]

Accessed 2026

Polaris.: Biogen adme-fang-v1. Accessed 2026. https://polarishub.io/datasets/ biogen/adme-fang-v1

2026

[8] [8]

ChemBERTa: Large-Scale Self- Supervised Pretraining for Molecular Property Prediction

Chithrananda S, Grand G, Ramsundar B. ChemBERTa: Large-Scale Self- Supervised Pretraining for Molecular Property Prediction. arXiv preprint. 2020;arXiv:2010.09885

arXiv 2020

[9] [9]

Self-Supervised Graph Transformer on Large-Scale Molecular Data

Rong Y, Bian Y, Xu T, Xie W, Wei Y, Huang W, et al. Self-Supervised Graph Transformer on Large-Scale Molecular Data. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 33; 2020

2020

[10] [10]

Molecular contrastive learning of representations via graph neural networks , url =

Wang Y, Wang J, Cao Z, Farimani AB. Molecular contrastive learning of represen- tations via graph neural networks. Nature Machine Intelligence. 2022;4:279–287. https://doi.org/10.1038/s42256-022-00447-x

work page doi:10.1038/s42256-022-00447-x 2022

[11] [11]

Uni-Mol: A Universal 3D Molecular Representation Learning Framework

Zhou G, Gao Z, Ding Q, Zheng H, Xu H, Wei Z, et al. Uni-Mol: A Universal 3D Molecular Representation Learning Framework. In: International Conference on Learning Representations (ICLR); 2023

2023

[12] [12]

CatBoost: unbi- ased boosting with categorical features

Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbi- ased boosting with categorical features. In: Advances in Neural Information Processing Systems (NeurIPS); 2018

2018

[13] [13]

Extended-Connectivity Fingerprints

Rogers D, Hahn M. Extended-Connectivity Fingerprints. Journal of Chem- ical Information and Modeling. 2010;50(5):742–754. https://doi.org/10.1021/ ci100050t

2010

[14] [14]

QSAR: How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets

Gedeck P, Rohde B, Bartels C. QSAR: How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets. Journal of Chemical Information and Modeling. 2006;46(5):1924–1936. https://doi.org/ 10.1021/ci050423u. 19

work page doi:10.1021/ci050423u 2006

[15] [15]

ErG: 2D Pharmacophore Descrip- tions for Scaffold Hopping

Stiefl N, Watson IA, Baumann K, Zaliani A. ErG: 2D Pharmacophore Descrip- tions for Scaffold Hopping. Journal of Chemical Information and Modeling. 2006;46(1):208–220. https://doi.org/10.1021/ci050457y

work page doi:10.1021/ci050457y 2006

[16] [16]

Jacob Cohen

Bran AM, Cox S, Schilter O, Baldassari C, White AD, Schwaller P. Augment- ing large language models with chemistry tools. Nature Machine Intelligence. 2024;6:525–535. https://doi.org/10.1038/s42256-024-00832-8

work page doi:10.1038/s42256-024-00832-8 2024

[17] [17]

Autonomous chemical research with large language models

Boiko DA, MacKnight R, Kline B, Gomes G. Autonomous chemical research with large language models. Nature. 2023;624:570–578. https://doi.org/10.1038/ s41586-023-06792-0

2023

[18] [18]

Bulaong, John E

Swanson K, Wu W, Bulaong NL, Pak JE, Zou J. The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies. Nature. 2025;646:716–723. https://doi. org/10.1038/s41586-025-09442-9

work page doi:10.1038/s41586-025-09442-9 2025

[19] [19]

Accelerating scientific discovery with co-scientist

Gottweis J, Weng WH, Daryin A, et al. Accelerating scientific discovery with Co-Scientist. Nature. 2026;https://doi.org/10.1038/s41586-026-10644-y

work page doi:10.1038/s41586-026-10644-y 2026

[20] [20]

DrugAgent: Automating AI- aided Drug Discovery Programming through LLM Multi-Agent Collaboration

Liu S, Lu Y, Chen S, Hu X, Zhao J, Lu Y, et al. DrugAgent: Automating AI- aided Drug Discovery Programming through LLM Multi-Agent Collaboration. arXiv preprint. 2024;arXiv:2411.15692

arXiv 2024

[21] [21]

MolAgent: Biomolecular Property Estimation in the Agentic Era

G´ omez-Tamayo JC, Tavernier J, Aerts R, Dyubankova N, Van Rompaey D, Menon S, et al. MolAgent: Biomolecular Property Estimation in the Agentic Era. Journal of Chemical Information and Modeling. 2025;65(20):10808–10818. https://doi.org/10.1021/acs.jcim.5c01938

work page doi:10.1021/acs.jcim.5c01938 2025

[22] [22]

Large language models for scientific discovery in molecular property prediction

Zheng Y, Koh HY, Ju J, Nguyen ATN, May LT, Webb GI, et al. Large language models for scientific discovery in molecular property prediction. Nature Machine Intelligence. 2025;7(3):437–447. https://doi.org/10.1038/s42256-025-00994-z

work page doi:10.1038/s42256-025-00994-z 2025

[23] [23]

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

Huang Q, Vora J, Liang P, Leskovec J. MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation. In: Proceedings of the 41st Inter- national Conference on Machine Learning. vol. 235 of Proceedings of Machine Learning Research; 2024. p. 20271–20309

2024

[24] [24]

MLE- bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Chan JS, Chowdhury N, Jaffe O, Aung J, Sherburn D, Mays E, et al. MLE- bench: Evaluating Machine Learning Agents on Machine Learning Engineering. In: International Conference on Learning Representations (ICLR); 2025

2025

[25] [25]

AIDE: AI- Driven Exploration in the Space of Code

Jiang Z, Schmidt D, Srikanth D, Xu D, Kaplan I, Jacenko D, et al. AIDE: AI- Driven Exploration in the Space of Code. arXiv preprint. 2025;arXiv:2502.13138

Pith/arXiv arXiv 2025

[26] [26]

Towards end-to-end automation of AI research

Lu C, Lu C, Lange RT, et al. Towards end-to-end automation of AI research. Nature. 2026;651:914–919. https://doi.org/10.1038/s41586-026-10265-5. 20

work page doi:10.1038/s41586-026-10265-5 2026

[27] [27]

The AI Scientist-v2: Workshop-Level Automated Sci- entific Discovery via Agentic Tree Search

Yamada Y, Akiba T, et al. The AI Scientist-v2: Workshop-Level Automated Sci- entific Discovery via Agentic Tree Search. arXiv preprint. 2025;arXiv:2504.08066

Pith/arXiv arXiv 2025

[28] [28]

FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Wang Z, Zhang X, Goyal A, Pratt S, Ji J, Wu J, et al. FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights. arXiv preprint. 2026;arXiv:2602.02905

arXiv 2026

[29] [29]

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Garikaparthi R, Charkhgard H, Asawa P, Deshpande C, Wang K, Hsu CCY, et al. ResearchGym: Evaluating Language Model Agents on Real-World AI Research. arXiv preprint. 2026;arXiv:2602.15112

arXiv 2026

[30] [30]

ResearchClawBench: Bench- marking Autonomous Agents on End-to-End Paper-Level Research Tasks

Xu M, Yang Y, Li Y, Huang Y, Lin X, Du SS, et al. ResearchClawBench: Bench- marking Autonomous Agents on End-to-End Paper-Level Research Tasks. arXiv preprint. 2026;arXiv:2606.07591

Pith/arXiv arXiv 2026

[31] [31]

SciAgentArena: Benchmarking Multi-Domain Scientific Agents Across Scales

Liu S, Ma S, Zhang H, Yin Y, Zhao Y, Dai J, et al. SciAgentArena: Benchmarking Multi-Domain Scientific Agents Across Scales. arXiv preprint. 2026;arXiv:2606.12736

Pith/arXiv arXiv 2026

[32] [32]

Practical Bayesian Optimization of Machine Learning Algorithms

Snoek J, Larochelle H, Adams RP. Practical Bayesian Optimization of Machine Learning Algorithms. In: Advances in Neural Information Processing Systems. vol. 25; 2012. p. 2951–2959

2012

[33] [33]

Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms

Thornton C, Hutter F, Hoos HH, Leyton-Brown K. Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2013. p. 847–855

2013

[34] [34]

Auto-ADMET: An Effective and Interpretable AutoML Method for Chemical ADMET Property Prediction

de S ’a AGC, Ascher DB. Auto-ADMET: An Effective and Interpretable AutoML Method for Chemical ADMET Property Prediction. arXiv preprint. 2025;arXiv:2502.16378

arXiv 2025

[35] [35]

On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

Cawley GC, Talbot NLC. On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research. 2010;11:2079–2107

2010

[36] [36]

Preserving Statistical Validity in Adaptive Data Analysis

Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A. Preserving Statistical Validity in Adaptive Data Analysis. In: Proceedings of the 47th Annual ACM Symposium on Theory of Computing; 2015. p. 117–126

2015

[37] [37]

The reusable holdout: Preserving validity in adaptive data analysis.Science, 349(6248): 636–638, 2015

Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A. The Reusable Holdout: Preserving Validity in Adaptive Data Analysis. Science. 2015;349(6248):636–638. https://doi.org/10.1126/science.aaa9375

work page doi:10.1126/science.aaa9375 2015

[38] [38]

Lo-Hi: Practical ML Drug Discovery Benchmark

Steshin S. Lo-Hi: Practical ML Drug Discovery Benchmark. In: Advances in Neu- ral Information Processing Systems (NeurIPS) Datasets and Benchmarks Track; 21

[39] [39]

Data splitting to avoid information leakage with DataSAIL

Joeres R, Blumenthal DB, Kalinina OV. Data splitting to avoid information leakage with DataSAIL. Nature Communications. 2025;16:3337. https://doi.org/ 10.1038/s41467-025-58606-8

work page doi:10.1038/s41467-025-58606-8 2025

[40] [40]

: G2-structures and octonion bundles

Kapoor S, Narayanan A. Leakage and the Reproducibility Crisis in Machine- Learning-Based Science. Patterns. 2023;4(9):100804. https://doi.org/10.1016/j. patter.2023.100804

work page doi:10.1016/j 2023

[41] [41]

XGBoost: A Scalable Tree Boosting System

Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016. p. 785–794

2016

[42] [42]

LightGBM: A Highly Efficient Gradient Boosting Decision Tree

Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In: Advances in Neural Information Processing Systems. vol. 30; 2017

2017

[43] [43]

FLAML: A Fast and Lightweight AutoML Library

Wang C, Wu Q, Weimer M, Zhu E. FLAML: A Fast and Lightweight AutoML Library. arXiv preprint. 2019;arXiv:1911.04706

arXiv 2019

[44] [44]

Accessed 2026

Landrum G, et al.: RDKit: Open-source cheminformatics. Accessed 2026. https: //www.rdkit.org. 22 S1 Search trajectories Figure S1 reports the best-so-far aggregate normalised improvement on the TDC ADMET suite for each isolated axis across the budget of one hundred trials per axis, with one additional trial on the feature axis. The curve for each axis is ...

2026