pith. sign in

arxiv: 2606.22731 · v1 · pith:PK2BUOS4new · submitted 2026-06-22 · 💻 cs.AI · cs.MA

Closed-loop Auto Research for Molecular Property Prediction: Discovering and Certifying Generalizable Improvements

Pith reviewed 2026-06-26 09:10 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords molecular property predictionlanguage model agentsautomated researchheld-out certificationexternal data acquisitionfeature and model searchbenchmark evaluation
0
0 comments X

The pith

A routed pipeline selects each molecular endpoint's best validation axis and delivers positive held-out gains of 0.013 to 0.042 across three benchmark suites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether language-model agents that edit molecular features, model code, and external data can produce improvements that survive evaluation on test labels the search process never sees. It isolates three search axes under a file-level ablation that credits each gain to one change, runs the process on 36 endpoints from TDC, Polaris, and MoleculeNet, and then measures the chosen configurations on held-out tests. A pipeline that routes every endpoint to its strongest validation axis records small positive test gains, while single-axis searches sometimes collapse and a standard AutoML baseline fails to match the agent's model edits. The work isolates the concrete lesson that discovery on a validation proxy must be followed by separate certification on unseen labels.

Core claim

A routed pipeline that assigns each endpoint to its best validation axis (features, models, or external evidence) produces held-out test gains of 0.013 on TDC, 0.011 on Polaris, and 0.042 on MoleculeNet; the transferable axis varies by suite, model-search gains drop from 0.041 on validation to 0.003 on test, curated external data can lift specific endpoints such as CYP2C9-substrate by 0.17 when passed through an overlap filter, and an AutoML control reaches only 0.006 against the agent's 0.042.

What carries the argument

The routed pipeline with file-level ablation lock that attributes each performance change to exactly one axis (features, models, or external evidence) over a fixed baseline.

If this is right

  • The axis that transfers differs by benchmark suite: external data on TDC, models on Polaris, and both features and models on MoleculeNet.
  • Individual searches can produce large validation gains that largely disappear on held-out test labels.
  • Curated external data improves performance on particular endpoints once the overlap filter is applied.
  • The language-model agent's code edits outperform a matched automated machine learning control that does not intervene at the source-code level.
  • The pipeline remains competitive with an 84M-parameter pretrained 3D model on the shared training split.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation between proxy-driven discovery and held-out certification could be applied to automated research loops in other scientific domains that optimize against validation proxies.
  • Stronger leakage diagnostics beyond file-overlap statistics may be required before external-data gains can be trusted to generalize.
  • Extending the agent action space to include hyperparameter schedules or multi-task training objectives might change which axis transfers most reliably.

Load-bearing premise

The overlap-based contamination filter is treated as adequate to block leakage when external data files are admitted, even though the paper states the filter is necessary but not sufficient.

What would settle it

Re-running the identical routed pipeline on a fresh collection of endpoints whose external data sources share zero structures with the test sets and checking whether the reported positive held-out gains remain or vanish.

read the original abstract

Closed-loop Auto Research extends automated machine learning from fixed-dataset fitting to changing the research workflow, with language-model agents editing representations and model code and acquiring external evidence. Molecular property prediction spans many small endpoints. We ask whether this action space yields improvements generalizing beyond the validation signal selecting them. We isolate three Auto Research axes, features, models, and external evidence, under a file-level ablation lock attributing each gain to one axis over a strong baseline. Across 36 endpoints in three benchmark suites we score each selected configuration once on a held-out test whose labels the search never read. A routed pipeline taking each endpoint's best validation axis reaches positive held-out gains of 0.013, 0.011, and 0.042, the transferable axis differing by suite, data on TDC, model on Polaris, feature and model on MoleculeNet. The largest model-search gain falls from 0.041 on validation to 0.003 on test, while curated data reaches 0.022 but negative 0.019 on test, two non-transfer signatures. Curated external data raises held-out CYP2C9-substrate performance by 0.17 and half-life by 0.08, admitted through a contamination filter rejecting same-source files overlapping 64 to 89 percent of test structures, necessary but not sufficient for transfer. A matched-trial automated machine learning control did not reproduce the agent's code-level model intervention, reaching 0.006 against 0.042, and the pipeline stays competitive with an 84M-parameter pretrained 3D model on the shared training split. The experiments stay within molecular property prediction, but separating discovery from held-out certification is a domain-agnostic lesson for any closed-loop system optimising a proxy for a held-out quantity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript describes a closed-loop Auto Research system in which language-model agents modify molecular representations, model code, and acquire external data for property prediction tasks. Using an ablation lock to attribute gains to specific axes (features, models, external evidence) and evaluating on held-out tests across 36 endpoints in TDC, Polaris, and MoleculeNet, it reports that a routed selection of the best validation axis yields positive test gains of 0.013 (TDC, data), 0.011 (Polaris, model), and 0.042 (MoleculeNet, feature/model). Some interventions fail to transfer (e.g., model-search gain drops from 0.041 validation to 0.003 test; curated data from 0.022 to -0.019), and external data improves specific endpoints (CYP2C9 by 0.17, half-life by 0.08) after a contamination filter. A matched AutoML control fails to reproduce the agent's model intervention.

Significance. If the held-out gains prove robust, the work supplies a useful case study of separating discovery from certification in automated workflows, with explicit non-transfer examples and an AutoML control providing informative negative results. The domain-agnostic emphasis on proxy optimization versus held-out evaluation is a constructive contribution to closed-loop AutoML research.

major comments (2)
  1. [External evidence results] External evidence results (abstract and corresponding results section): The held-out gains from curated external data (0.17 on CYP2C9-substrate, 0.08 on half-life) are central to demonstrating value in the external-evidence axis. However, the manuscript explicitly states that the contamination filter (rejecting same-source files with 64–89 % test-structure overlap) is “necessary but not sufficient for transfer.” This directly raises the possibility that residual leakage (shared substructures, scaffolds, or assay conditions) accounts for the improvements rather than the Auto Research process, weakening the attribution of generalizable gains.
  2. [Methods] Methods (data splits, ablation lock, and filter implementation): The central transfer claims rest on the precise definition of the file-level ablation lock, the exact computation of the 64–89 % overlap threshold, and the full list of admitted external sources. The provided text does not supply these details or the code, making it impossible to confirm that post-hoc choices or undetected leakage do not affect the reported positive held-out gains of 0.013/0.011/0.042.
minor comments (2)
  1. [Abstract] Abstract: Explicitly map the three gains (0.013, 0.011, 0.042) to the three suites and their transferable axes for immediate clarity.
  2. [Abstract] Abstract: The statement that “the pipeline stays competitive with an 84M-parameter pretrained 3D model” should report the exact metric value and training-split comparison to allow direct evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the need to strengthen claims around external data transfer and to improve methodological transparency. We respond to each major comment below.

read point-by-point responses
  1. Referee: [External evidence results] External evidence results (abstract and corresponding results section): The held-out gains from curated external data (0.17 on CYP2C9-substrate, 0.08 on half-life) are central to demonstrating value in the external-evidence axis. However, the manuscript explicitly states that the contamination filter (rejecting same-source files with 64–89 % test-structure overlap) is “necessary but not sufficient for transfer.” This directly raises the possibility that residual leakage (shared substructures, scaffolds, or assay conditions) accounts for the improvements rather than the Auto Research process, weakening the attribution of generalizable gains.

    Authors: We agree that residual leakage via shared substructures or assay conditions cannot be ruled out by the file-level filter alone, and that this limits strong attribution of the 0.17 and 0.08 gains specifically to the Auto Research process. The manuscript already qualifies the filter as “necessary but not sufficient,” and the non-transfer results on other axes are presented precisely to illustrate the difficulty of generalization. In revision we will add an explicit limitations paragraph on this point, temper the abstract language around the external-evidence gains, and include additional post-hoc checks (scaffold overlap statistics and a random-substructure ablation) if they can be completed without new data access. revision: yes

  2. Referee: [Methods] Methods (data splits, ablation lock, and filter implementation): The central transfer claims rest on the precise definition of the file-level ablation lock, the exact computation of the 64–89 % overlap threshold, and the full list of admitted external sources. The provided text does not supply these details or the code, making it impossible to confirm that post-hoc choices or undetected leakage do not affect the reported positive held-out gains of 0.013/0.011/0.042.

    Authors: The referee is correct that the current text omits the exact definition of the file-level ablation lock, the overlap-threshold algorithm, and the enumerated external sources. These omissions prevent independent verification. We will expand the Methods section with pseudocode for the ablation lock and filter, the precise overlap metric (Tanimoto on Morgan fingerprints at radius 2), the 64–89 % range derivation, and the list of admitted sources. The full implementation will be released with the camera-ready version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; held-out certification is independent

full rationale

The paper's central results rest on a held-out test set whose labels are never accessed during the search or axis selection process, with explicit reporting of non-transfer cases (e.g., model-search gain dropping from 0.041 to 0.003, curated data from 0.022 to -0.019). The routed pipeline selects on validation but certifies on unseen test data, and the contamination filter is openly described as 'necessary but not sufficient.' No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation. The evaluation chain is self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The work relies on standard supervised learning assumptions and the unstated premise that the LM agent's code edits are causally responsible for observed differences versus the matched AutoML control.

pith-pipeline@v0.9.1-grok · 5872 in / 1251 out tokens · 26874 ms · 2026-06-26T09:10:54.659532+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 15 canonical work pages

  1. [1]

    Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes

    Ning J, Li X, Zeng J, Kang H, Xiong C. Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes. arXiv preprint. 2026;arXiv:2605.05724

  2. [2]

    ADMET property prediction through combinations of molecular fingerprints

    Notwell JH, Wood MW. ADMET property prediction through combinations of molecular fingerprints. arXiv preprint. 2023;arXiv:2310.00174

  3. [3]

    Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development

    Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, et al. Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development. In: Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks; 2021. . 18

  4. [4]

    Artifi- cial intelligence foundation for therapeutic science

    Huang K, Fu T, Gao W, Zhao Y, Roohani Y, Leskovec J, et al. Artifi- cial intelligence foundation for therapeutic science. Nature Chemical Biology. 2022;18:1033–1036. https://doi.org/10.1038/s41589-022-01131-2

  5. [5]

    N.; Gomes, J.; Geniesse, C.; Pappu, A

    Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, et al. MoleculeNet: a benchmark for molecular machine learning. Chemical Science. 2018;9(2):513–530. https://doi.org/10.1039/C7SC02664A

  6. [6]

    Prospective Vali- dation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective

    Fang C, Wang Y, Grater R, Kapadnis S, Black C, Trapa P, et al. Prospective Vali- dation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective. Journal of Chemical Infor- mation and Modeling. 2023;63(11):3263–3274. https://doi.org/10.1021/acs.jcim. 3c00160

  7. [7]

    Accessed 2026

    Polaris.: Biogen adme-fang-v1. Accessed 2026. https://polarishub.io/datasets/ biogen/adme-fang-v1

  8. [8]

    ChemBERTa: Large-Scale Self- Supervised Pretraining for Molecular Property Prediction

    Chithrananda S, Grand G, Ramsundar B. ChemBERTa: Large-Scale Self- Supervised Pretraining for Molecular Property Prediction. arXiv preprint. 2020;arXiv:2010.09885

  9. [9]

    Self-Supervised Graph Transformer on Large-Scale Molecular Data

    Rong Y, Bian Y, Xu T, Xie W, Wei Y, Huang W, et al. Self-Supervised Graph Transformer on Large-Scale Molecular Data. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 33; 2020

  10. [10]

    Molecular contrastive learning of representations via graph neural networks , url =

    Wang Y, Wang J, Cao Z, Farimani AB. Molecular contrastive learning of represen- tations via graph neural networks. Nature Machine Intelligence. 2022;4:279–287. https://doi.org/10.1038/s42256-022-00447-x

  11. [11]

    Uni-Mol: A Universal 3D Molecular Representation Learning Framework

    Zhou G, Gao Z, Ding Q, Zheng H, Xu H, Wei Z, et al. Uni-Mol: A Universal 3D Molecular Representation Learning Framework. In: International Conference on Learning Representations (ICLR); 2023

  12. [12]

    CatBoost: unbi- ased boosting with categorical features

    Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbi- ased boosting with categorical features. In: Advances in Neural Information Processing Systems (NeurIPS); 2018

  13. [13]

    Extended-Connectivity Fingerprints

    Rogers D, Hahn M. Extended-Connectivity Fingerprints. Journal of Chem- ical Information and Modeling. 2010;50(5):742–754. https://doi.org/10.1021/ ci100050t

  14. [14]

    QSAR: How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets

    Gedeck P, Rohde B, Bartels C. QSAR: How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets. Journal of Chemical Information and Modeling. 2006;46(5):1924–1936. https://doi.org/ 10.1021/ci050423u. 19

  15. [15]

    ErG: 2D Pharmacophore Descrip- tions for Scaffold Hopping

    Stiefl N, Watson IA, Baumann K, Zaliani A. ErG: 2D Pharmacophore Descrip- tions for Scaffold Hopping. Journal of Chemical Information and Modeling. 2006;46(1):208–220. https://doi.org/10.1021/ci050457y

  16. [16]

    Jacob Cohen

    Bran AM, Cox S, Schilter O, Baldassari C, White AD, Schwaller P. Augment- ing large language models with chemistry tools. Nature Machine Intelligence. 2024;6:525–535. https://doi.org/10.1038/s42256-024-00832-8

  17. [17]

    Autonomous chemical research with large language models

    Boiko DA, MacKnight R, Kline B, Gomes G. Autonomous chemical research with large language models. Nature. 2023;624:570–578. https://doi.org/10.1038/ s41586-023-06792-0

  18. [18]

    Bulaong, John E

    Swanson K, Wu W, Bulaong NL, Pak JE, Zou J. The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies. Nature. 2025;646:716–723. https://doi. org/10.1038/s41586-025-09442-9

  19. [19]

    Accelerating scientific discovery with co-scientist

    Gottweis J, Weng WH, Daryin A, et al. Accelerating scientific discovery with Co-Scientist. Nature. 2026;https://doi.org/10.1038/s41586-026-10644-y

  20. [20]

    DrugAgent: Automating AI- aided Drug Discovery Programming through LLM Multi-Agent Collaboration

    Liu S, Lu Y, Chen S, Hu X, Zhao J, Lu Y, et al. DrugAgent: Automating AI- aided Drug Discovery Programming through LLM Multi-Agent Collaboration. arXiv preprint. 2024;arXiv:2411.15692

  21. [21]

    MolAgent: Biomolecular Property Estimation in the Agentic Era

    G´ omez-Tamayo JC, Tavernier J, Aerts R, Dyubankova N, Van Rompaey D, Menon S, et al. MolAgent: Biomolecular Property Estimation in the Agentic Era. Journal of Chemical Information and Modeling. 2025;65(20):10808–10818. https://doi.org/10.1021/acs.jcim.5c01938

  22. [22]

    Large language models for scientific discovery in molecular property prediction

    Zheng Y, Koh HY, Ju J, Nguyen ATN, May LT, Webb GI, et al. Large language models for scientific discovery in molecular property prediction. Nature Machine Intelligence. 2025;7(3):437–447. https://doi.org/10.1038/s42256-025-00994-z

  23. [23]

    MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

    Huang Q, Vora J, Liang P, Leskovec J. MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation. In: Proceedings of the 41st Inter- national Conference on Machine Learning. vol. 235 of Proceedings of Machine Learning Research; 2024. p. 20271–20309

  24. [24]

    MLE- bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Chan JS, Chowdhury N, Jaffe O, Aung J, Sherburn D, Mays E, et al. MLE- bench: Evaluating Machine Learning Agents on Machine Learning Engineering. In: International Conference on Learning Representations (ICLR); 2025

  25. [25]

    AIDE: AI- Driven Exploration in the Space of Code

    Jiang Z, Schmidt D, Srikanth D, Xu D, Kaplan I, Jacenko D, et al. AIDE: AI- Driven Exploration in the Space of Code. arXiv preprint. 2025;arXiv:2502.13138

  26. [26]

    Towards end-to-end automation of AI research

    Lu C, Lu C, Lange RT, et al. Towards end-to-end automation of AI research. Nature. 2026;651:914–919. https://doi.org/10.1038/s41586-026-10265-5. 20

  27. [27]

    The AI Scientist-v2: Workshop-Level Automated Sci- entific Discovery via Agentic Tree Search

    Yamada Y, Akiba T, et al. The AI Scientist-v2: Workshop-Level Automated Sci- entific Discovery via Agentic Tree Search. arXiv preprint. 2025;arXiv:2504.08066

  28. [28]

    FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

    Wang Z, Zhang X, Goyal A, Pratt S, Ji J, Wu J, et al. FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights. arXiv preprint. 2026;arXiv:2602.02905

  29. [29]

    ResearchGym: Evaluating Language Model Agents on Real-World AI Research

    Garikaparthi R, Charkhgard H, Asawa P, Deshpande C, Wang K, Hsu CCY, et al. ResearchGym: Evaluating Language Model Agents on Real-World AI Research. arXiv preprint. 2026;arXiv:2602.15112

  30. [30]

    ResearchClawBench: Bench- marking Autonomous Agents on End-to-End Paper-Level Research Tasks

    Xu M, Yang Y, Li Y, Huang Y, Lin X, Du SS, et al. ResearchClawBench: Bench- marking Autonomous Agents on End-to-End Paper-Level Research Tasks. arXiv preprint. 2026;arXiv:2606.07591

  31. [31]

    SciAgentArena: Benchmarking Multi-Domain Scientific Agents Across Scales

    Liu S, Ma S, Zhang H, Yin Y, Zhao Y, Dai J, et al. SciAgentArena: Benchmarking Multi-Domain Scientific Agents Across Scales. arXiv preprint. 2026;arXiv:2606.12736

  32. [32]

    Practical Bayesian Optimization of Machine Learning Algorithms

    Snoek J, Larochelle H, Adams RP. Practical Bayesian Optimization of Machine Learning Algorithms. In: Advances in Neural Information Processing Systems. vol. 25; 2012. p. 2951–2959

  33. [33]

    Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms

    Thornton C, Hutter F, Hoos HH, Leyton-Brown K. Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2013. p. 847–855

  34. [34]

    Auto-ADMET: An Effective and Interpretable AutoML Method for Chemical ADMET Property Prediction

    de S ’a AGC, Ascher DB. Auto-ADMET: An Effective and Interpretable AutoML Method for Chemical ADMET Property Prediction. arXiv preprint. 2025;arXiv:2502.16378

  35. [35]

    On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

    Cawley GC, Talbot NLC. On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research. 2010;11:2079–2107

  36. [36]

    Preserving Statistical Validity in Adaptive Data Analysis

    Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A. Preserving Statistical Validity in Adaptive Data Analysis. In: Proceedings of the 47th Annual ACM Symposium on Theory of Computing; 2015. p. 117–126

  37. [37]

    The reusable holdout: Preserving validity in adaptive data analysis.Science, 349(6248): 636–638, 2015

    Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A. The Reusable Holdout: Preserving Validity in Adaptive Data Analysis. Science. 2015;349(6248):636–638. https://doi.org/10.1126/science.aaa9375

  38. [38]

    Lo-Hi: Practical ML Drug Discovery Benchmark

    Steshin S. Lo-Hi: Practical ML Drug Discovery Benchmark. In: Advances in Neu- ral Information Processing Systems (NeurIPS) Datasets and Benchmarks Track; 21

  39. [39]

    Data splitting to avoid information leakage with DataSAIL

    Joeres R, Blumenthal DB, Kalinina OV. Data splitting to avoid information leakage with DataSAIL. Nature Communications. 2025;16:3337. https://doi.org/ 10.1038/s41467-025-58606-8

  40. [40]

    : G2-structures and octonion bundles

    Kapoor S, Narayanan A. Leakage and the Reproducibility Crisis in Machine- Learning-Based Science. Patterns. 2023;4(9):100804. https://doi.org/10.1016/j. patter.2023.100804

  41. [41]

    XGBoost: A Scalable Tree Boosting System

    Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016. p. 785–794

  42. [42]

    LightGBM: A Highly Efficient Gradient Boosting Decision Tree

    Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In: Advances in Neural Information Processing Systems. vol. 30; 2017

  43. [43]

    FLAML: A Fast and Lightweight AutoML Library

    Wang C, Wu Q, Weimer M, Zhu E. FLAML: A Fast and Lightweight AutoML Library. arXiv preprint. 2019;arXiv:1911.04706

  44. [44]

    Accessed 2026

    Landrum G, et al.: RDKit: Open-source cheminformatics. Accessed 2026. https: //www.rdkit.org. 22 S1 Search trajectories Figure S1 reports the best-so-far aggregate normalised improvement on the TDC ADMET suite for each isolated axis across the budget of one hundred trials per axis, with one additional trial on the feature axis. The curve for each axis is ...