Closed-loop Auto Research for Molecular Property Prediction: Discovering and Certifying Generalizable Improvements
Pith reviewed 2026-06-26 09:10 UTC · model grok-4.3
The pith
A routed pipeline selects each molecular endpoint's best validation axis and delivers positive held-out gains of 0.013 to 0.042 across three benchmark suites.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A routed pipeline that assigns each endpoint to its best validation axis (features, models, or external evidence) produces held-out test gains of 0.013 on TDC, 0.011 on Polaris, and 0.042 on MoleculeNet; the transferable axis varies by suite, model-search gains drop from 0.041 on validation to 0.003 on test, curated external data can lift specific endpoints such as CYP2C9-substrate by 0.17 when passed through an overlap filter, and an AutoML control reaches only 0.006 against the agent's 0.042.
What carries the argument
The routed pipeline with file-level ablation lock that attributes each performance change to exactly one axis (features, models, or external evidence) over a fixed baseline.
If this is right
- The axis that transfers differs by benchmark suite: external data on TDC, models on Polaris, and both features and models on MoleculeNet.
- Individual searches can produce large validation gains that largely disappear on held-out test labels.
- Curated external data improves performance on particular endpoints once the overlap filter is applied.
- The language-model agent's code edits outperform a matched automated machine learning control that does not intervene at the source-code level.
- The pipeline remains competitive with an 84M-parameter pretrained 3D model on the shared training split.
Where Pith is reading between the lines
- The same separation between proxy-driven discovery and held-out certification could be applied to automated research loops in other scientific domains that optimize against validation proxies.
- Stronger leakage diagnostics beyond file-overlap statistics may be required before external-data gains can be trusted to generalize.
- Extending the agent action space to include hyperparameter schedules or multi-task training objectives might change which axis transfers most reliably.
Load-bearing premise
The overlap-based contamination filter is treated as adequate to block leakage when external data files are admitted, even though the paper states the filter is necessary but not sufficient.
What would settle it
Re-running the identical routed pipeline on a fresh collection of endpoints whose external data sources share zero structures with the test sets and checking whether the reported positive held-out gains remain or vanish.
read the original abstract
Closed-loop Auto Research extends automated machine learning from fixed-dataset fitting to changing the research workflow, with language-model agents editing representations and model code and acquiring external evidence. Molecular property prediction spans many small endpoints. We ask whether this action space yields improvements generalizing beyond the validation signal selecting them. We isolate three Auto Research axes, features, models, and external evidence, under a file-level ablation lock attributing each gain to one axis over a strong baseline. Across 36 endpoints in three benchmark suites we score each selected configuration once on a held-out test whose labels the search never read. A routed pipeline taking each endpoint's best validation axis reaches positive held-out gains of 0.013, 0.011, and 0.042, the transferable axis differing by suite, data on TDC, model on Polaris, feature and model on MoleculeNet. The largest model-search gain falls from 0.041 on validation to 0.003 on test, while curated data reaches 0.022 but negative 0.019 on test, two non-transfer signatures. Curated external data raises held-out CYP2C9-substrate performance by 0.17 and half-life by 0.08, admitted through a contamination filter rejecting same-source files overlapping 64 to 89 percent of test structures, necessary but not sufficient for transfer. A matched-trial automated machine learning control did not reproduce the agent's code-level model intervention, reaching 0.006 against 0.042, and the pipeline stays competitive with an 84M-parameter pretrained 3D model on the shared training split. The experiments stay within molecular property prediction, but separating discovery from held-out certification is a domain-agnostic lesson for any closed-loop system optimising a proxy for a held-out quantity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes a closed-loop Auto Research system in which language-model agents modify molecular representations, model code, and acquire external data for property prediction tasks. Using an ablation lock to attribute gains to specific axes (features, models, external evidence) and evaluating on held-out tests across 36 endpoints in TDC, Polaris, and MoleculeNet, it reports that a routed selection of the best validation axis yields positive test gains of 0.013 (TDC, data), 0.011 (Polaris, model), and 0.042 (MoleculeNet, feature/model). Some interventions fail to transfer (e.g., model-search gain drops from 0.041 validation to 0.003 test; curated data from 0.022 to -0.019), and external data improves specific endpoints (CYP2C9 by 0.17, half-life by 0.08) after a contamination filter. A matched AutoML control fails to reproduce the agent's model intervention.
Significance. If the held-out gains prove robust, the work supplies a useful case study of separating discovery from certification in automated workflows, with explicit non-transfer examples and an AutoML control providing informative negative results. The domain-agnostic emphasis on proxy optimization versus held-out evaluation is a constructive contribution to closed-loop AutoML research.
major comments (2)
- [External evidence results] External evidence results (abstract and corresponding results section): The held-out gains from curated external data (0.17 on CYP2C9-substrate, 0.08 on half-life) are central to demonstrating value in the external-evidence axis. However, the manuscript explicitly states that the contamination filter (rejecting same-source files with 64–89 % test-structure overlap) is “necessary but not sufficient for transfer.” This directly raises the possibility that residual leakage (shared substructures, scaffolds, or assay conditions) accounts for the improvements rather than the Auto Research process, weakening the attribution of generalizable gains.
- [Methods] Methods (data splits, ablation lock, and filter implementation): The central transfer claims rest on the precise definition of the file-level ablation lock, the exact computation of the 64–89 % overlap threshold, and the full list of admitted external sources. The provided text does not supply these details or the code, making it impossible to confirm that post-hoc choices or undetected leakage do not affect the reported positive held-out gains of 0.013/0.011/0.042.
minor comments (2)
- [Abstract] Abstract: Explicitly map the three gains (0.013, 0.011, 0.042) to the three suites and their transferable axes for immediate clarity.
- [Abstract] Abstract: The statement that “the pipeline stays competitive with an 84M-parameter pretrained 3D model” should report the exact metric value and training-split comparison to allow direct evaluation.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying the need to strengthen claims around external data transfer and to improve methodological transparency. We respond to each major comment below.
read point-by-point responses
-
Referee: [External evidence results] External evidence results (abstract and corresponding results section): The held-out gains from curated external data (0.17 on CYP2C9-substrate, 0.08 on half-life) are central to demonstrating value in the external-evidence axis. However, the manuscript explicitly states that the contamination filter (rejecting same-source files with 64–89 % test-structure overlap) is “necessary but not sufficient for transfer.” This directly raises the possibility that residual leakage (shared substructures, scaffolds, or assay conditions) accounts for the improvements rather than the Auto Research process, weakening the attribution of generalizable gains.
Authors: We agree that residual leakage via shared substructures or assay conditions cannot be ruled out by the file-level filter alone, and that this limits strong attribution of the 0.17 and 0.08 gains specifically to the Auto Research process. The manuscript already qualifies the filter as “necessary but not sufficient,” and the non-transfer results on other axes are presented precisely to illustrate the difficulty of generalization. In revision we will add an explicit limitations paragraph on this point, temper the abstract language around the external-evidence gains, and include additional post-hoc checks (scaffold overlap statistics and a random-substructure ablation) if they can be completed without new data access. revision: yes
-
Referee: [Methods] Methods (data splits, ablation lock, and filter implementation): The central transfer claims rest on the precise definition of the file-level ablation lock, the exact computation of the 64–89 % overlap threshold, and the full list of admitted external sources. The provided text does not supply these details or the code, making it impossible to confirm that post-hoc choices or undetected leakage do not affect the reported positive held-out gains of 0.013/0.011/0.042.
Authors: The referee is correct that the current text omits the exact definition of the file-level ablation lock, the overlap-threshold algorithm, and the enumerated external sources. These omissions prevent independent verification. We will expand the Methods section with pseudocode for the ablation lock and filter, the precise overlap metric (Tanimoto on Morgan fingerprints at radius 2), the 64–89 % range derivation, and the list of admitted sources. The full implementation will be released with the camera-ready version. revision: yes
Circularity Check
No significant circularity; held-out certification is independent
full rationale
The paper's central results rest on a held-out test set whose labels are never accessed during the search or axis selection process, with explicit reporting of non-transfer cases (e.g., model-search gain dropping from 0.041 to 0.003, curated data from 0.022 to -0.019). The routed pipeline selects on validation but certifies on unseen test data, and the contamination filter is openly described as 'necessary but not sufficient.' No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation. The evaluation chain is self-contained against external benchmarks and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes
Ning J, Li X, Zeng J, Kang H, Xiong C. Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes. arXiv preprint. 2026;arXiv:2605.05724
Pith/arXiv arXiv 2026
-
[2]
ADMET property prediction through combinations of molecular fingerprints
Notwell JH, Wood MW. ADMET property prediction through combinations of molecular fingerprints. arXiv preprint. 2023;arXiv:2310.00174
arXiv 2023
-
[3]
Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development
Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, et al. Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development. In: Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks; 2021. . 18
2021
-
[4]
Artifi- cial intelligence foundation for therapeutic science
Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, et al. Artifi- cial intelligence foundation for therapeutic science. Nature Chemical Biology. 2022;18:1033–1036. https://doi.org/10.1038/s41589-022-01131-2
-
[5]
N.; Gomes, J.; Geniesse, C.; Pappu, A
Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, et al. MoleculeNet: a benchmark for molecular machine learning. Chemical Science. 2018;9(2):513–530. https://doi.org/10.1039/C7SC02664A
-
[6]
Fang C, Wang Y, Grater R, Kapadnis S, Black C, Trapa P, et al. Prospective Vali- dation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective. Journal of Chemical Infor- mation and Modeling. 2023;63(11):3263–3274. https://doi.org/10.1021/acs.jcim. 3c00160
-
[7]
Accessed 2026
Polaris.: Biogen adme-fang-v1. Accessed 2026. https://polarishub.io/datasets/ biogen/adme-fang-v1
2026
-
[8]
ChemBERTa: Large-Scale Self- Supervised Pretraining for Molecular Property Prediction
Chithrananda S, Grand G, Ramsundar B. ChemBERTa: Large-Scale Self- Supervised Pretraining for Molecular Property Prediction. arXiv preprint. 2020;arXiv:2010.09885
arXiv 2020
-
[9]
Self-Supervised Graph Transformer on Large-Scale Molecular Data
Rong Y, Bian Y, Xu T, Xie W, Wei Y, Huang W, et al. Self-Supervised Graph Transformer on Large-Scale Molecular Data. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 33; 2020
2020
-
[10]
Molecular contrastive learning of representations via graph neural networks , url =
Wang Y, Wang J, Cao Z, Farimani AB. Molecular contrastive learning of represen- tations via graph neural networks. Nature Machine Intelligence. 2022;4:279–287. https://doi.org/10.1038/s42256-022-00447-x
-
[11]
Uni-Mol: A Universal 3D Molecular Representation Learning Framework
Zhou G, Gao Z, Ding Q, Zheng H, Xu H, Wei Z, et al. Uni-Mol: A Universal 3D Molecular Representation Learning Framework. In: International Conference on Learning Representations (ICLR); 2023
2023
-
[12]
CatBoost: unbi- ased boosting with categorical features
Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbi- ased boosting with categorical features. In: Advances in Neural Information Processing Systems (NeurIPS); 2018
2018
-
[13]
Extended-Connectivity Fingerprints
Rogers D, Hahn M. Extended-Connectivity Fingerprints. Journal of Chem- ical Information and Modeling. 2010;50(5):742–754. https://doi.org/10.1021/ ci100050t
2010
-
[14]
Gedeck P, Rohde B, Bartels C. QSAR: How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets. Journal of Chemical Information and Modeling. 2006;46(5):1924–1936. https://doi.org/ 10.1021/ci050423u. 19
-
[15]
ErG: 2D Pharmacophore Descrip- tions for Scaffold Hopping
Stiefl N, Watson IA, Baumann K, Zaliani A. ErG: 2D Pharmacophore Descrip- tions for Scaffold Hopping. Journal of Chemical Information and Modeling. 2006;46(1):208–220. https://doi.org/10.1021/ci050457y
-
[16]
Bran AM, Cox S, Schilter O, Baldassari C, White AD, Schwaller P. Augment- ing large language models with chemistry tools. Nature Machine Intelligence. 2024;6:525–535. https://doi.org/10.1038/s42256-024-00832-8
-
[17]
Autonomous chemical research with large language models
Boiko DA, MacKnight R, Kline B, Gomes G. Autonomous chemical research with large language models. Nature. 2023;624:570–578. https://doi.org/10.1038/ s41586-023-06792-0
2023
-
[18]
Swanson K, Wu W, Bulaong NL, Pak JE, Zou J. The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies. Nature. 2025;646:716–723. https://doi. org/10.1038/s41586-025-09442-9
-
[19]
Accelerating scientific discovery with co-scientist
Gottweis J, Weng WH, Daryin A, et al. Accelerating scientific discovery with Co-Scientist. Nature. 2026;https://doi.org/10.1038/s41586-026-10644-y
-
[20]
DrugAgent: Automating AI- aided Drug Discovery Programming through LLM Multi-Agent Collaboration
Liu S, Lu Y, Chen S, Hu X, Zhao J, Lu Y, et al. DrugAgent: Automating AI- aided Drug Discovery Programming through LLM Multi-Agent Collaboration. arXiv preprint. 2024;arXiv:2411.15692
arXiv 2024
-
[21]
MolAgent: Biomolecular Property Estimation in the Agentic Era
G´ omez-Tamayo JC, Tavernier J, Aerts R, Dyubankova N, Van Rompaey D, Menon S, et al. MolAgent: Biomolecular Property Estimation in the Agentic Era. Journal of Chemical Information and Modeling. 2025;65(20):10808–10818. https://doi.org/10.1021/acs.jcim.5c01938
-
[22]
Large language models for scientific discovery in molecular property prediction
Zheng Y, Koh HY, Ju J, Nguyen ATN, May LT, Webb GI, et al. Large language models for scientific discovery in molecular property prediction. Nature Machine Intelligence. 2025;7(3):437–447. https://doi.org/10.1038/s42256-025-00994-z
-
[23]
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation
Huang Q, Vora J, Liang P, Leskovec J. MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation. In: Proceedings of the 41st Inter- national Conference on Machine Learning. vol. 235 of Proceedings of Machine Learning Research; 2024. p. 20271–20309
2024
-
[24]
MLE- bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Chan JS, Chowdhury N, Jaffe O, Aung J, Sherburn D, Mays E, et al. MLE- bench: Evaluating Machine Learning Agents on Machine Learning Engineering. In: International Conference on Learning Representations (ICLR); 2025
2025
-
[25]
AIDE: AI- Driven Exploration in the Space of Code
Jiang Z, Schmidt D, Srikanth D, Xu D, Kaplan I, Jacenko D, et al. AIDE: AI- Driven Exploration in the Space of Code. arXiv preprint. 2025;arXiv:2502.13138
Pith/arXiv arXiv 2025
-
[26]
Towards end-to-end automation of AI research
Lu C, Lu C, Lange RT, et al. Towards end-to-end automation of AI research. Nature. 2026;651:914–919. https://doi.org/10.1038/s41586-026-10265-5. 20
-
[27]
The AI Scientist-v2: Workshop-Level Automated Sci- entific Discovery via Agentic Tree Search
Yamada Y, Akiba T, et al. The AI Scientist-v2: Workshop-Level Automated Sci- entific Discovery via Agentic Tree Search. arXiv preprint. 2025;arXiv:2504.08066
Pith/arXiv arXiv 2025
-
[28]
FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights
Wang Z, Zhang X, Goyal A, Pratt S, Ji J, Wu J, et al. FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights. arXiv preprint. 2026;arXiv:2602.02905
arXiv 2026
-
[29]
ResearchGym: Evaluating Language Model Agents on Real-World AI Research
Garikaparthi R, Charkhgard H, Asawa P, Deshpande C, Wang K, Hsu CCY, et al. ResearchGym: Evaluating Language Model Agents on Real-World AI Research. arXiv preprint. 2026;arXiv:2602.15112
arXiv 2026
-
[30]
ResearchClawBench: Bench- marking Autonomous Agents on End-to-End Paper-Level Research Tasks
Xu M, Yang Y, Li Y, Huang Y, Lin X, Du SS, et al. ResearchClawBench: Bench- marking Autonomous Agents on End-to-End Paper-Level Research Tasks. arXiv preprint. 2026;arXiv:2606.07591
Pith/arXiv arXiv 2026
-
[31]
SciAgentArena: Benchmarking Multi-Domain Scientific Agents Across Scales
Liu S, Ma S, Zhang H, Yin Y, Zhao Y, Dai J, et al. SciAgentArena: Benchmarking Multi-Domain Scientific Agents Across Scales. arXiv preprint. 2026;arXiv:2606.12736
Pith/arXiv arXiv 2026
-
[32]
Practical Bayesian Optimization of Machine Learning Algorithms
Snoek J, Larochelle H, Adams RP. Practical Bayesian Optimization of Machine Learning Algorithms. In: Advances in Neural Information Processing Systems. vol. 25; 2012. p. 2951–2959
2012
-
[33]
Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms
Thornton C, Hutter F, Hoos HH, Leyton-Brown K. Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2013. p. 847–855
2013
-
[34]
Auto-ADMET: An Effective and Interpretable AutoML Method for Chemical ADMET Property Prediction
de S ’a AGC, Ascher DB. Auto-ADMET: An Effective and Interpretable AutoML Method for Chemical ADMET Property Prediction. arXiv preprint. 2025;arXiv:2502.16378
arXiv 2025
-
[35]
On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
Cawley GC, Talbot NLC. On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research. 2010;11:2079–2107
2010
-
[36]
Preserving Statistical Validity in Adaptive Data Analysis
Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A. Preserving Statistical Validity in Adaptive Data Analysis. In: Proceedings of the 47th Annual ACM Symposium on Theory of Computing; 2015. p. 117–126
2015
-
[37]
Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A. The Reusable Holdout: Preserving Validity in Adaptive Data Analysis. Science. 2015;349(6248):636–638. https://doi.org/10.1126/science.aaa9375
-
[38]
Lo-Hi: Practical ML Drug Discovery Benchmark
Steshin S. Lo-Hi: Practical ML Drug Discovery Benchmark. In: Advances in Neu- ral Information Processing Systems (NeurIPS) Datasets and Benchmarks Track; 21
-
[39]
Data splitting to avoid information leakage with DataSAIL
Joeres R, Blumenthal DB, Kalinina OV. Data splitting to avoid information leakage with DataSAIL. Nature Communications. 2025;16:3337. https://doi.org/ 10.1038/s41467-025-58606-8
-
[40]
: G2-structures and octonion bundles
Kapoor S, Narayanan A. Leakage and the Reproducibility Crisis in Machine- Learning-Based Science. Patterns. 2023;4(9):100804. https://doi.org/10.1016/j. patter.2023.100804
work page doi:10.1016/j 2023
-
[41]
XGBoost: A Scalable Tree Boosting System
Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016. p. 785–794
2016
-
[42]
LightGBM: A Highly Efficient Gradient Boosting Decision Tree
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In: Advances in Neural Information Processing Systems. vol. 30; 2017
2017
-
[43]
FLAML: A Fast and Lightweight AutoML Library
Wang C, Wu Q, Weimer M, Zhu E. FLAML: A Fast and Lightweight AutoML Library. arXiv preprint. 2019;arXiv:1911.04706
arXiv 2019
-
[44]
Accessed 2026
Landrum G, et al.: RDKit: Open-source cheminformatics. Accessed 2026. https: //www.rdkit.org. 22 S1 Search trajectories Figure S1 reports the best-so-far aggregate normalised improvement on the TDC ADMET suite for each isolated axis across the budget of one hundred trials per axis, with one additional trial on the feature axis. The curve for each axis is ...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.