pith. machine review for the scientific record. sign in

arxiv: 2604.21937 · v1 · submitted 2026-04-02 · 💻 cs.AI · cs.MA

Recognition: no theorem link

MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:23 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords autonomous agentsdrug discoveryhierarchical skillsmolecular screeningAI agentsbenchmarksworkflow orchestrationcomputational chemistry
0
0 comments X

The pith

A three-tier hierarchical skill architecture lets an AI agent reliably handle complex multi-step drug molecule workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MolClaw, an autonomous agent that integrates more than 30 specialized tools for drug molecule evaluation, screening, and optimization. Skills are organized into three tiers: basic tool operations, composed workflows that include quality checks and reflection, and a top-level discipline skill that supplies scientific principles for planning. This setup supports long sequences of tool calls that current agents handle poorly. The authors introduce MolBench, a benchmark with tasks ranging from 8 to over 50 sequential calls, and show that MolClaw reaches state-of-the-art results across metrics. Gains appear mainly on structured workflow tasks and disappear on simpler ad hoc problems, pointing to workflow orchestration as the main limit for AI in this domain.

Core claim

MolClaw unifies over 30 domain resources into 70 skills through a three-tier hierarchy that enables robust long-term interaction: tool-level skills standardize atomic operations, workflow-level skills assemble them into validated pipelines with built-in checks, and a discipline-level skill supplies governing scientific principles. On the introduced MolBench benchmark of molecular screening, optimization, and end-to-end discovery tasks, MolClaw attains state-of-the-art performance, with ablation results showing that improvements concentrate on tasks needing structured workflows and vanish on those solvable by ad hoc scripting.

What carries the argument

The three-tier hierarchical skill architecture that organizes atomic tool operations, validated multi-step pipelines, and field-wide scientific principles to support planning and verification.

If this is right

  • Agents gain the ability to maintain quality across dozens of sequential tool calls in drug discovery.
  • Performance improvements are tied directly to tasks that require validated workflow composition.
  • Workflow orchestration competence becomes the central capability needed to advance AI-driven drug discovery.
  • End-to-end discovery challenges spanning screening through optimization become feasible for autonomous agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same layered skill design could transfer to other scientific fields that combine many specialized tools, such as materials design or synthetic biology.
  • Embedding skills at runtime may reduce dependence on ever-longer prompts for complex domains.
  • Testing the architecture on physical lab robots would show whether the digital workflow gains carry over to real experimental loops.
  • If workflow orchestration is the bottleneck, future agent work should prioritize pipeline validation over raw tool count.

Load-bearing premise

The three-tier hierarchical skill structure is the main reason for better performance rather than prompt engineering, tool selection, or benchmark tuning.

What would settle it

Measure whether removing the workflow-level or discipline-level skills causes MolClaw's advantage to disappear specifically on the 8-to-50-step tasks in MolBench while leaving simple tasks unchanged.

Figures

Figures reproduced from arXiv: 2604.21937 by Bowen Zhou, Haoran Sun, Haoyang Su, Lei Bai, Lilong Wang, Lisheng Zhang, Qikui Yang, Qingsong Li, Wei Tang, Wenjie Lou, Xiangyu Sun, Xiaosong Wang, Yankai Jiang, Yingnan Han, Yuehui Qian, Zhengwei Xie, Zhenyu Tang.

Figure 1
Figure 1. Figure 1: Technical architecture of MolClaw. The left side illustrates the construction of hierarchical [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Agent execution traces for the three MolBench-E2E tasks. (A) E2E-Q1: coarse-grained conforma￾tional sampling. Five tool-level failures (red) were resolved via skill-governed recovery actions (orange), yielding 20 verified all-atom structures. (B) E2E-Q2: QED-driven iterative optimization. One tool fallback (F1), two constraint￾driven rejections (F2–F3), and five strategy adaptations (D1–D5) were autonomous… view at source ↗
Figure 3
Figure 3. Figure 3: MolClaw achieves state-of-the-art performance across all MolBench evaluation dimensions. (A) Binding affinity comparison accuracy. MolClaw-CC achieves 81.1%. (B) Docking screening hit count. MolClaw-CC attains 0.80. (C) Molecule editing accuracy. MolClaw-CC reaches 100.0%. (D) Optimization success rate. (E) Property filtering accuracy. (F) Property filtering F1 score. (G) Agent systems grouped comparison a… view at source ↗
Figure 4
Figure 4. Figure 4: Statistical validation confirms the significance and reliability of MolClaw’s performance advan￾tages. (A) Normalized performance heatmap across seven metrics for 13 methods. MolClaw variants highlighted by red borders. (B–E) Wilson score 95% CI forest plots for binding affinity accuracy (B), molecule editing accuracy (C), optimization success rate (D), and property filtering accuracy (E). (F–I) Category-l… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation studies and in-depth statistical analyses reveal the mechanistic basis of MolClaw’s superiority. (A–C) Ablation on Claude Code and OpenClaw platforms: accuracy metrics (A), docking hit count (B), optimization delta (C). Largest skill-driven gain: binding affinity +29.7 pp (P = 0.013, h = 0.64). (D) Rank trajectory across four tasks for top six methods. (E) Average rank (Friedman χ 2 = 35.35, P = 2… view at source ↗
Figure 6
Figure 6. Figure 6: Coarse-grained conformational sampling of the EGFR kinase domain by Ope￾nAWSEM and GoCa. (A) Superposition of 10 PULCHRA-reconstructed all-atom conformations from the OpenAWSEM ensemble, aligned to the 1M17 crystal structure. (B) Corresponding super￾position for the GoCa ensemble. (C) Cα-RMSD to native structure: GoCa 4.54 ± 0.93 Å versus AWSEM 7.78 ± 1.53 Å (P = 7.69 × 10−4 ). (D) Radius of gyration: GoCa… view at source ↗
Figure 7
Figure 7. Figure 7: QED-driven iterative optimization of a triazolo-benzodiazepine scaffold by the AI agent. (A) Multi-dimensional property trajectory across five optimization rounds: QED score (target ≥ 0.70), MW, ALogP, Tanimoto similarity (constraint ≥ 0.40), TPSA and rotatable bonds. (B) QED desirability decomposition by component and round (R0–R5). (C) QED–Tanimoto trade-off for all 182 molecules; red stars, selected bes… view at source ↗
Figure 8
Figure 8. Figure 8: Comprehensive evaluation of AI-agent-driven iterative lead optimization of Erlotinib targeting the EGFR kinase domain. (A) Optimization trajectory showing best QuickVina docking score per round; blue dashed line: Erlotinib baseline (−6.9 kcal/mol); red dashed line: −8.9 kcal/mol target. (B) Docking score distributions across Rounds 1–6 (box-and-strip plot, n = 54). (C) Tanimoto similarity heatmap between r… view at source ↗
Figure 9
Figure 9. Figure 9: Schrödinger-style 2D protein–ligand interaction diagrams and 3D pose overlay. (A) Erlotinib baseline (−6.9 kcal/mol): two H-bonds (Thr766, Asp831) and eight hydrophobic contacts. (B) R1 best (−7.4): methoxy shortening + meta-Br; new Met769 H-bond (2.99 Å). (C) R2 best (−8.0): Br→F substitution; Met769 maintained. (D) R3 best (−8.3): F + OH + CH3 on aniline. (E) R4 best (−8.9, target met): 2,6-diF-4-OH anil… view at source ↗
Figure 10
Figure 10. Figure 10: Statistical validation, source attribution, and interaction conservation analysis. (A) Docking scores of all 54 molecules by source: REINVENT4 (blue, n = 20) and agent-designed (red, n = 34). (B) Per-round mean scores ± s.e.m. tested against baseline (Wilcoxon); R1 n.s., R2–R3 ∗∗, R4–R6 ∗∗∗. (C) Early (R1–R3) vs. late (R4–R6) violin plot (p = 1.24 × 10−4 ). (D) Agent vs. REINVENT violin plot (p = 0.104, n… view at source ↗
read the original abstract

Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi-step workflows, yet current AI agents struggle to maintain robust performance and consistently underperform in these high-complexity scenarios. Here we present MolClaw, an autonomous agent that leads drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources through a three-tier hierarchical skill architecture (70 skills in total) that facilitates agent long-term interaction at runtime: tool-level skills standardize atomic operations, workflow-level skills compose them into validated pipelines with quality check and reflection, and a discipline-level skill supplies scientific principles governing planning and verification across all scenarios in the field. Additionally, we introduce MolBench, a benchmark comprising molecular screening, optimization, and end-to-end discovery challenges spanning 8 to 50+ sequential tool calls. MolClaw achieves state-of-the-art performance across all metrics, and ablation studies confirm that gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting, establishing workflow orchestration competence as the primary capability bottleneck for AI-driven drug discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MolClaw, an autonomous agent for drug molecule evaluation, screening, and optimization that integrates over 30 specialized resources via a three-tier hierarchical skill architecture (tool-level skills for atomic operations, workflow-level skills for validated pipelines with reflection, and discipline-level skills for scientific principles). It also presents MolBench, a benchmark of molecular tasks requiring 8–50+ sequential tool calls. The central claims are that MolClaw achieves state-of-the-art performance across all metrics on MolBench and that ablation studies show performance gains are concentrated on tasks needing structured workflows while disappearing on ad-hoc-scriptable tasks, identifying workflow orchestration as the key bottleneck for AI in drug discovery.

Significance. If the empirical claims hold after proper controls, the work would provide concrete evidence that explicit hierarchical skill decomposition improves long-horizon reliability in scientific agent workflows, a result with direct implications for AI-assisted drug discovery pipelines. The introduction of MolBench as a standardized, multi-step benchmark would also be a useful community resource. However, the absence of quantitative metrics, baseline details, and statistical tests in the abstract, combined with the stress-test concern on ablation isolation, leaves the magnitude and attribution of the advance currently unverifiable.

major comments (3)
  1. Abstract and §4 (Ablation Studies): The claim that 'gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting' is load-bearing for the central thesis, yet the manuscript provides no description of how the ad-hoc baseline agents were prompted, whether they received the same 30+ resources, or how reflection/error-handling templates were matched. Without these controls, the performance gap cannot be confidently attributed to the three-tier hierarchy rather than differences in prompting or tool access.
  2. Abstract and §3 (Experimental Setup): No quantitative metrics, error bars, baseline names, or statistical tests are reported even in the abstract, despite the SOTA claim. The full methods section must supply exact performance numbers, number of runs, and comparison agents (including their architecture and prompting) before the SOTA assertion can be evaluated.
  3. §2.2 (Hierarchical Skill Architecture): The discipline-level skill is described as supplying 'scientific principles governing planning and verification,' but the manuscript does not show how these principles are encoded or enforced at runtime versus being implicit in the workflow-level reflection steps. This distinction is necessary to isolate the contribution of the explicit hierarchy.
minor comments (2)
  1. Figure 1 and §2.1: The diagram of the three-tier architecture would benefit from explicit arrows showing runtime control flow between levels and from a legend distinguishing static skill definitions from dynamic invocation.
  2. §5 (Related Work): The comparison to prior agent frameworks (e.g., ReAct, Toolformer) should include a table contrasting skill hierarchy depth, number of tools, and benchmark task length to clarify novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications based on the manuscript content and outlining specific revisions to improve transparency and rigor.

read point-by-point responses
  1. Referee: Abstract and §4 (Ablation Studies): The claim that 'gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting' is load-bearing for the central thesis, yet the manuscript provides no description of how the ad-hoc baseline agents were prompted, whether they received the same 30+ resources, or how reflection/error-handling templates were matched. Without these controls, the performance gap cannot be confidently attributed to the three-tier hierarchy rather than differences in prompting or tool access.

    Authors: We agree that explicit controls are essential for attributing gains to the hierarchy. The ad-hoc baselines received identical access to the full set of over 30 resources and used the same tool-calling format; they differed only by omitting workflow-level composition and discipline-level principle injection, relying on standard ReAct-style prompting without structured reflection templates. To strengthen this, we will expand §4 with a new subsection containing example prompts for each baseline, a configuration comparison table, and confirmation that error-handling was matched at the tool level only. This revision will make the isolation of the hierarchy's contribution fully verifiable. revision: yes

  2. Referee: Abstract and §3 (Experimental Setup): No quantitative metrics, error bars, baseline names, or statistical tests are reported even in the abstract, despite the SOTA claim. The full methods section must supply exact performance numbers, number of runs, and comparison agents (including their architecture and prompting) before the SOTA assertion can be evaluated.

    Authors: We acknowledge that the abstract prioritizes high-level claims over numbers due to length limits, but the full §3 and §4 already contain performance tables, baseline names (ReAct, Reflexion, and standard LLM agents), and run counts. We will revise the abstract to report key metrics (e.g., success rates with standard deviations from 5 runs per task) and add explicit statistical tests (paired t-tests, p < 0.05) to §3. All baseline architectures and prompting details will be cross-referenced to the new ablation subsection for completeness. revision: yes

  3. Referee: §2.2 (Hierarchical Skill Architecture): The discipline-level skill is described as supplying 'scientific principles governing planning and verification,' but the manuscript does not show how these principles are encoded or enforced at runtime versus being implicit in the workflow-level reflection steps. This distinction is necessary to isolate the contribution of the explicit hierarchy.

    Authors: The discipline-level skill is implemented as an explicit, callable module containing a fixed set of encoded scientific principles (e.g., ADMET rules, synthetic accessibility heuristics) that are injected via a dedicated runtime function call before planning and verification phases. This operates independently of workflow-level reflection, which only processes execution feedback. We will add pseudocode and a runtime diagram to §2.2 illustrating the distinct call sequence and enforcement mechanism to clearly separate the two levels. revision: yes

Circularity Check

0 steps flagged

No circularity: new architecture and benchmark evaluated empirically

full rationale

The paper introduces MolClaw's three-tier hierarchy and MolBench as original contributions, with SOTA claims and ablation results presented as direct empirical outcomes on the new benchmark rather than reductions to prior inputs. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the derivation of the core claims. Ablation statements about gains vanishing on ad-hoc tasks are framed as experimental observations, not tautological by construction. The chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the agent and benchmark themselves; full paper would be needed to audit any hidden hyperparameters or domain assumptions about tool reliability.

invented entities (1)
  • MolClaw hierarchical skill architecture no independent evidence
    purpose: To enable long-term robust interaction across 30+ domain tools for drug discovery workflows
    The three-tier system (tool, workflow, discipline) is presented as the core innovation; no independent evidence outside the paper is supplied in the abstract.

pith-pipeline@v0.9.0 · 5559 in / 1330 out tokens · 35758 ms · 2026-05-13T21:23:01.068824+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 4 internal anchors

  1. [1]

    Gromacs: High performance molecular simulations through multi-level paral- lelism from laptops to supercomputers.SoftwareX, 1:19–25, 2015

    Mark James Abraham, Teemu Murtola, Roland Schulz, Szilárd Páll, Jeremy C Smith, Berk Hess, and Erik Lindahl. Gromacs: High performance molecular simulations through multi-level paral- lelism from laptops to supercomputers.SoftwareX, 1:19–25, 2015

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 27

  3. [3]

    Alphafold db: Open repository of protein structure predictions

    AlphaFold Database Consortium. Alphafold db: Open repository of protein structure predictions. https://alphafold.ebi.ac.uk/, 2024

  4. [4]

    The claude model family.https://www.anthropic.com/claude, 2024

    Anthropic. The claude model family.https://www.anthropic.com/claude, 2024

  5. [5]

    Claude code: a command-line tool for agentic coding.https://docs.anthropic.c om/en/docs/claude-code, 2025

    Anthropic. Claude code: a command-line tool for agentic coding.https://docs.anthropic.c om/en/docs/claude-code, 2025

  6. [6]

    Liddia: Language-based intelligent drug discovery agent

    Reza Averly, Frazier N Baker, Ian A Watson, and Xia Ning. Liddia: Language-based intelligent drug discovery agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12015–12039, 2025

  7. [7]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  8. [8]

    Artificial intelligence in drug discovery: what is realistic, what are illusions? part 1: Ways to make an impact, and why we are not there yet

    Andreas Bender and Isidro Cortés-Ciriano. Artificial intelligence in drug discovery: what is realistic, what are illusions? part 1: Ways to make an impact, and why we are not there yet. Drug discovery today, 26(2):511–524, 2021

  9. [9]

    Quantifying the chemical beauty of drugs.Nature Chemistry, 4:90–98, 2012

    G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs.Nature Chemistry, 4:90–98, 2012

  10. [10]

    Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

    Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

  11. [11]

    Prolif: a library to encode molecular interactions as fingerprints.Journal of cheminformatics, 13(1):72, 2021

    Cédric Bouysset and Sébastien Fiorucci. Prolif: a library to encode molecular interactions as fingerprints.Journal of cheminformatics, 13(1):72, 2021

  12. [12]

    Evobind: in silico directed evolution of peptide binders with alphafold.bioRxiv, pages 2022–07, 2022

    Patrick Bryant and Arne Elofsson. Evobind: in silico directed evolution of peptide binders with alphafold.bioRxiv, pages 2022–07, 2022

  13. [13]

    Generic protein–ligand interaction scoring by inte- grating physical prior knowledge and data augmentation modelling.Nature Machine Intelligence, 6(6):688–700, 2024

    Duanhua Cao, Geng Chen, Jiaxin Jiang, Jie Yu, Runze Zhang, Mingan Chen, Wei Zhang, Lifan Chen, Feisheng Zhong, Yingying Zhang, et al. Generic protein–ligand interaction scoring by inte- grating physical prior knowledge and data augmentation modelling.Nature Machine Intelligence, 6(6):688–700, 2024

  14. [14]

    Mozi: Governed autonomy for drug discovery llm agents.arXiv preprint arXiv:2603.03655, 2026

    He Cao, Siyu Liu, Fan Zhang, Zijing Liu, Hao Li, Bin Feng, Shengyuan Bai, Leqing Chen, Kai Xie, and Yu Li. Mozi: Governed autonomy for drug discovery llm agents.arXiv preprint arXiv:2603.03655, 2026

  15. [15]

    Chembl: The global bioactivity database for drug discovery.https: //www.ebi.ac.uk/chembl/, 2024

    ChEMBL Consortium. Chembl: The global bioactivity database for drug discovery.https: //www.ebi.ac.uk/chembl/, 2024

  16. [16]

    Diffdock: Diffusion steps, twists, and turns for molecular docking.arXiv preprint arXiv:2210.01776, 2022

    Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, and Tommi Jaakkola. Diffdock: Diffusion steps, twists, and turns for molecular docking.arXiv preprint arXiv:2210.01776, 2022

  17. [17]

    Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

    Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

  18. [18]

    Pymol: An open-source molecular graphics tool.CCP4 Newsl

    Warren L DeLano et al. Pymol: An open-source molecular graphics tool.CCP4 Newsl. protein crystallogr, 40(1):82–92, 2002

  19. [19]

    Leading ai-driven drug discovery platforms: 2025 landscape and global outlook

    Mahendiran Dharmasivam, Busra Kaya, Adedoyin Akinware, Mahan Gholam Azad, and Des R Richardson. Leading ai-driven drug discovery platforms: 2025 landscape and global outlook. Pharmacological Reviews, page 100102, 2025

  20. [20]

    Vina-gpu 2.0: further accelerating autodock vina and its derivatives with graphics processing units.Journal of chemical information and modeling, 63(7):1982–1998, 2023

    Ji Ding, Shidi Tang, Zheming Mei, Lingyue Wang, Qinqin Huang, Haifeng Hu, Ming Ling, and Jiansheng Wu. Vina-gpu 2.0: further accelerating autodock vina and its derivatives with graphics processing units.Journal of chemical information and modeling, 63(7):1982–1998, 2023. 28

  21. [21]

    Openmm 7: Rapid development of high performance algorithms for molecular dynamics.PLoS computational biology, 13(7):e1005659, 2017

    Peter Eastman, Jason Swails, John D Chodera, Robert T McGibbon, Yutong Zhao, Kyle A Beauchamp, Lee-Ping Wang, Andrew C Simmonett, Matthew P Harrigan, Chaya D Stern, et al. Openmm 7: Rapid development of high performance algorithms for molecular dynamics.PLoS computational biology, 13(7):e1005659, 2017

  22. [22]

    Autodock vina 1.2

    Jerome Eberhardt, Diogo Santos-Martins, Andreas F Tillack, and Stefano Forli. Autodock vina 1.2. 0: new docking methods, expanded force field, and python bindings.Journal of chemical information and modeling, 61(8):3891–3898, 2021

  23. [23]

    Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery.arXiv preprint arXiv:2602.08990, 2026

    Shiyang Feng, Runmin Ma, Xiangchao Yan, Yue Fan, Yusong Hu, Songtao Huang, Shuaiyu Zhang, Zongsheng Cao, Tianshuo Peng, Jiakang Yuan, et al. Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery.arXiv preprint arXiv:2602.08990, 2026

  24. [24]

    Glide: a new approach for rapid, accurate docking and scoring

    Richard A Friesner, Jay L Banks, Robert B Murphy, Thomas A Halgren, Jasna J Klicic, Daniel T Mainz, Matthew P Repasky, Eric H Knoll, Mee Shelley, Jason K Perry, et al. Glide: a new approach for rapid, accurate docking and scoring. 1. method and assessment of docking accuracy. Journal of medicinal chemistry, 47(7):1739–1749, 2004

  25. [25]

    Empowering biomedical discovery with ai agents.Cell, 187(22):6125–6151, 2024

    Shanghua Gao, Ada Fang, Yepeng Huang, Valentina Giunchiglia, Ayush Noori, Jonathan Richard Schwarz, Yasha Ektefaie, Jovana Kondic, and Marinka Zitnik. Empowering biomedical discovery with ai agents.Cell, 187(22):6125–6151, 2024

  26. [26]

    Txagent: an ai agent for therapeutic reasoning across a universe of tools.arXiv preprint arXiv:2503.10970, 2025

    Shanghua Gao, Richard Zhu, Zhenglun Kong, Ayush Noori, Xiaorui Su, Curtis Ginder, Theodoros Tsiligkaridis, and Marinka Zitnik. Txagent: an ai agent for therapeutic reasoning across a universe of tools.arXiv preprint arXiv:2503.10970, 2025

  27. [27]

    Democratising real-world drug discovery through agentic ai.Drug Discovery Today, page 104605, 2026

    Jiazhen He, Helen Lai, Lakshidaa Saigiridharan, Gian Marco Ghiandoni, Kinga Jenei, Umur Gokalp, Ajsa Nukovic, Ola Engkvist, Jon Paul Janet, and Samuel Genheden. Democratising real-world drug discovery through agentic ai.Drug Discovery Today, page 104605, 2026

  28. [28]

    Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

    Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

  29. [29]

    Illuminating protein space with a programmable generative model.Nature, 623(7989):1070–1078, 2023

    John B Ingraham, Max Baranov, Zak Costello, Karl W Barber, Wujie Wang, Ahmed Ismail, VincentFrappier, DanaMLord, ChristopherNg-Thow-Hing, ErikRVanVlack, etal. Illuminating protein space with a programmable generative model.Nature, 623(7989):1070–1078, 2023

  30. [30]

    Scp: Accelerating discovery with a global web of autonomous scientific agents.arXiv preprint arXiv:2512.24189, 2025

    Yankai Jiang, Wenjie Lou, Lilong Wang, Zhenyu Tang, Shiyang Feng, Jiaxuan Lu, Haoran Sun, Yaning Pan, Shuang Gu, Haoyang Su, et al. Scp: Accelerating discovery with a global web of autonomous scientific agents.arXiv preprint arXiv:2512.24189, 2025

  31. [31]

    Deepsite: protein-binding site predictor using 3d-convolutional neural networks.Bioinformatics, 33(19):3036–3042, 2017

    José Jiménez, Stefan Doerr, Gerard Martínez-Rosell, Alexander S Rose, and Gianni De Fabritiis. Deepsite: protein-binding site predictor using 3d-convolutional neural networks.Bioinformatics, 33(19):3036–3042, 2017

  32. [32]

    Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021

    JohnJumper, RichardEvans, AlexanderPritzel, TimGreen, MichaelFigurnov, OlafRonneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021

  33. [33]

    P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure.Journal of cheminformatics, 10(1):39, 2018

    Radoslav Krivák and David Hoksza. P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure.Journal of cheminformatics, 10(1):39, 2018

  34. [34]

    Fpocket: an open source platform for ligand pocket detection.BMC bioinformatics, 10(1):168, 2009

    Vincent Le Guilloux, Peter Schmidtke, and Pierre Tuffery. Fpocket: an open source platform for ligand pocket detection.BMC bioinformatics, 10(1):168, 2009

  35. [35]

    Scalable emulation of protein equilibrium ensembles with generative deep learning.Science, 389(6761):eadv9817, 2025

    Sarah Lewis, Tim Hempel, José Jiménez-Luna, Michael Gastegger, Yu Xie, Andrew YK Foong, Victor García Satorras, Osama Abdin, Bastiaan S Veeling, Iryna Zaporozhets, et al. Scalable emulation of protein equilibrium ensembles with generative deep learning.Science, 389(6761):eadv9817, 2025. 29

  36. [36]

    Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations.arXiv preprint arXiv:2505.21318, 2025

    Hao Li, He Cao, Bin Feng, Yanjun Shao, Xiangru Tang, Zhiyuan Yan, Li Yuan, Yonghong Tian, and Yu Li. Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations.arXiv preprint arXiv:2505.21318, 2025

  37. [37]

    Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

    Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

  38. [38]

    Drugagent: Automating ai-aided drug discovery programming through llm multi-agent collaboration.arXiv preprint arXiv:2411.15692, 2024

    SizheLiu, YizhouLu, SiyuChen, XiyangHu, JieyuZhao, YingzhouLu, andYueZhao. Drugagent: Automating ai-aided drug discovery programming through llm multi-agent collaboration.arXiv preprint arXiv:2411.15692, 2024

  39. [39]

    Reinvent 4: Modern ai–driven generative molecule design.Journal of Cheminformatics, 16(1):20, 2024

    Hannes H Loeffler, Jiazhen He, Alessandro Tibo, Jon Paul Janet, Alexey Voronov, Lewis H Mervin, and Ola Engkvist. Reinvent 4: Modern ai–driven generative molecule design.Journal of Cheminformatics, 16(1):20, 2024

  40. [40]

    Openawsem with open3spn2: A fast, flexible, and accessible framework for large-scale coarse-grained biomolecular simulations.PLoS computational biology, 17(2):e1008308, 2021

    Wei Lu, Carlos Bueno, Nicholas P Schafer, Joshua Moller, Shikai Jin, Xun Chen, Mingchen Chen, Xinyu Gu, Aram Davtyan, Juan J de Pablo, et al. Openawsem with open3spn2: A fast, flexible, and accessible framework for large-scale coarse-grained biomolecular simulations.PLoS computational biology, 17(2):e1008308, 2021

  41. [41]

    Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller

    Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Augmenting large language models with chemistry tools.Nature machine intelligence, 6(5):525–535, 2024

  42. [42]

    Augmented language models: a survey

    Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023

  43. [43]

    Openclaw: an open-source framework for building tool-augmented llm agents.https://github.com/openclaw, 2025

    OpenClaw Contributors. Openclaw: an open-source framework for building tool-augmented llm agents.https://github.com/openclaw, 2025

  44. [44]

    Frogent: An end-to-end full-process drug design agent.arXiv preprint arXiv:2508.10760, 2025

    Qihua Pan, Dong Xu, Jenna Xinyi Yao, Lijia Ma, Zexuan Zhu, and Junkai Ji. Frogent: An end-to-end full-process drug design agent.arXiv preprint arXiv:2508.10760, 2025

  45. [45]

    Boltz-2: Towards accurate and efficient binding affinity prediction.BioRxiv, 2025

    Saro Passaro, Gabriele Corso, Jeremy Wohlwend, Mateo Reveiz, Stephan Thaler, Vignesh Ram Somnath, Noah Getz, Tally Portnoi, Julien Roy, Hannes Stark, et al. Boltz-2: Towards accurate and efficient binding affinity prediction.BioRxiv, 2025

  46. [46]

    Pubchem: Open chemistry database.https://pubchem.ncbi.nlm.nih .gov/, 2025

    PubChem Consortium. Pubchem: Open chemistry database.https://pubchem.ncbi.nlm.nih .gov/, 2025

  47. [47]

    Rcsb pdb: Research collaboratory for structural bioinformatics protein data bank.https://www.rcsb.org/, 2024

    RCSB PDB Consortium. Rcsb pdb: Research collaboratory for structural bioinformatics protein data bank.https://www.rcsb.org/, 2024

  48. [48]

    Rdkit: Open-source cheminformatics software.https://www.rdkit.org/, 2024

    RDKit Consortium. Rdkit: Open-source cheminformatics software.https://www.rdkit.org/, 2024

  49. [49]

    Fast procedure for reconstruction of full-atom protein models from reduced representations.Journal of computational chemistry, 29(9):1460–1465, 2008

    Piotr Rotkiewicz and Jeffrey Skolnick. Fast procedure for reconstruction of full-atom protein models from reduced representations.Journal of computational chemistry, 29(9):1460–1465, 2008

  50. [50]

    Computational approaches streamlining drug discovery.Nature, 616(7958):673–685, 2023

    Anastasiia V Sadybekov and Vsevolod Katritch. Computational approaches streamlining drug discovery.Nature, 616(7958):673–685, 2023

  51. [51]

    Plip: fully automated protein–ligand interaction profiler.Nucleic acids research, 43(W1):W443– W447, 2015

    Sebastian Salentin, Sven Schreiber, V Joachim Haupt, Melissa F Adasme, and Michael Schroeder. Plip: fully automated protein–ligand interaction profiler.Nucleic acids research, 43(W1):W443– W447, 2015

  52. [52]

    Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023. 30

  53. [53]

    Understanding and predicting druggability

    Peter Schmidtke and Xavier Barril. Understanding and predicting druggability. a high-throughput method for detection of drug binding sites.Journal of medicinal chemistry, 53(15):5858–5867, 2010

  54. [54]

    Why 90% of clinical drug development fails and how to improve it?Acta Pharmaceutica Sinica B, 12(7):3049–3062, 2022

    Duxin Sun, Wei Gao, Hongxiang Hu, and Simon Zhou. Why 90% of clinical drug development fails and how to improve it?Acta Pharmaceutica Sinica B, 12(7):3049–3062, 2022

  55. [55]

    Admet-ai: a machine learning admet platform for evaluation of large-scale chemical libraries.Bioinformatics, 40(7):btae416, 2024

    Kyle Swanson, Parker Walther, Jeremy Leitz, Souhrid Mukherjee, Joseph C Wu, Rabindra V Shivnaraine, and James Zou. Admet-ai: a machine learning admet platform for evaluation of large-scale chemical libraries.Bioinformatics, 40(7):btae416, 2024

  56. [56]

    Chai-1: Decoding the molecular interactions of life

    Chai Discovery team, Jacques Boitreaud, Jack Dent, Matthew McPartlon, Joshua Meier, Vinicius Reis, Alex Rogozhonikov, and Kevin Wu. Chai-1: Decoding the molecular interactions of life. BioRxiv, pages 2024–10, 2024

  57. [57]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: A family of highly capable multimodal models. arxiv 2023.arXiv preprint arXiv:2312.11805, 2024

  58. [58]

    Uniprot: The universal protein knowledgebase.https://www.unipro t.org/, 2025

    The UniProt Consortium. Uniprot: The universal protein knowledgebase.https://www.unipro t.org/, 2025

  59. [59]

    Benchmarking compound activity prediction for real-world drug discovery applications.Commu- nications Chemistry, 7(1):127, 2024

    Tingzhong Tian, Shuya Li, Ziting Zhang, Lin Chen, Ziheng Zou, Dan Zhao, and Jianyang Zeng. Benchmarking compound activity prediction for real-world drug discovery applications.Commu- nications Chemistry, 7(1):127, 2024

  60. [60]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  61. [61]

    gmx_mmpbsa: a new tool to perform end-state free energy calculations with gromacs.Journal of chemical theory and computation, 17(10):6281–6291, 2021

    Mario S Valdés-Tresanco, Mario E Valdés-Tresanco, Pedro A Valiente, and Ernesto Moreno. gmx_mmpbsa: a new tool to perform end-state free energy calculations with gromacs.Journal of chemical theory and computation, 17(10):6281–6291, 2021

  62. [62]

    Applications of machine learning in drug discovery and development.Nature reviews Drug discovery, 18(6):463–477, 2019

    Jessica Vamathevan, Dominic Clark, Paul Czodrowski, Ian Dunham, Edgardo Ferran, George Lee, Bin Li, Anant Madabhushi, Parantu Shah, Michaela Spitzer, et al. Applications of machine learning in drug discovery and development.Nature reviews Drug discovery, 18(6):463–477, 2019

  63. [63]

    Prompt-to-pill: Multi- agent drug discovery and clinical simulation pipeline.Bioinformatics Advances, 6(1):vbaf323, 2026

    Ivana Vichentijevikj, Kostadin Mishev, and Monika Simjanoska Misheva. Prompt-to-pill: Multi- agent drug discovery and clinical simulation pipeline.Bioinformatics Advances, 6(1):vbaf323, 2026

  64. [64]

    Structure-based protein assembly simu- lations including various binding sites and conformations.Journal of Chemical Information and Modeling, 64(8):3465–3476, 2024

    Luis J Walter, Patrick K Quoika, and Martin Zacharias. Structure-based protein assembly simu- lations including various binding sites and conformations.Journal of Chemical Information and Modeling, 64(8):3465–3476, 2024

  65. [65]

    Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023

    Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023

  66. [66]

    Estimated research and development investment needed to bring a new medicine to market, 2009-2018.Jama, 323(9):844–853, 2020

    Olivier J Wouters, Martin McKee, and Jeroen Luyten. Estimated research and development investment needed to bring a new medicine to market, 2009-2018.Jama, 323(9):844–853, 2020

  67. [67]

    The hdock server for integrated protein–protein docking.Nature protocols, 15(5):1829–1852, 2020

    Yumeng Yan, Huanyu Tao, Jiahua He, and Sheng-You Huang. The hdock server for integrated protein–protein docking.Nature protocols, 15(5):1829–1852, 2020

  68. [68]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  69. [69]

    Efficient and accurate large library ligand docking with karmadock.Nature Computational Science, 3(9):789–804, 2023

    Xujun Zhang, Odin Zhang, Chao Shen, Wanglin Qu, Shicheng Chen, Hanqun Cao, Yu Kang, Zhe Wang, Ercheng Wang, Jintu Zhang, et al. Efficient and accurate large library ligand docking with karmadock.Nature Computational Science, 3(9):789–804, 2023. 31

  70. [70]

    Activity cliff prediction: Dataset and benchmark.arXiv preprint arXiv:2302.07541, 2023

    Ziqiao Zhang, Bangyi Zhao, Ailin Xie, Yatao Bian, and Shuigeng Zhou. Activity cliff prediction: Dataset and benchmark.arXiv preprint arXiv:2302.07541, 2023

  71. [71]

    delete hydroxyl

    Jie Zhu, Jingxiang Wang, Xin Wang, Mingjing Gao, Bingbing Guo, Miaomiao Gao, Jiarui Liu, Yanqiu Yu, Liang Wang, Weikaixin Kong, et al. Prediction of drug efficacy from transcriptional profiles with deep learning.Nature biotechnology, 39(11):1444–1452, 2021. Data A vailability Both the MolBench dataset (CSV format) and associated evaluation code can be acc...

  72. [72]

    Here we derive this bound analytically

    Predicting the optimization ceiling from scaffold topology A central finding of this study is that the triazolo-benzodiazepine scaffold imposes a hard upper bound on the achievable QED score. Here we derive this bound analytically. QED is defined as a weighted geometric mean of eight component desirability functionsdi (ref. [9]): QED= exp P8 i=1 wi lnd i ...

  73. [73]

    Tanimoto budget exhaustion

    Tanimoto budget exhaustion as a convergence diagnostic In our main Results we noted that the qualification rate—the fraction of generated molecules satisfying the Tanimoto≥0.40 constraint—declined from 100% (R1–R2) to 57.6% (R5). We propose that this declining rate constitutes a generalizable convergence diagnostic that we term “Tanimoto budget exhaustion...

  74. [74]

    The structural alerts (ALERTS) desirability exhibited a discontinuous phase transition: 0.241 at R0 to 0.842 at R1, with no further change in R2–R5

    Phase transitions versus gradual improvement in property optimization Not all QED components improved gradually. The structural alerts (ALERTS) desirability exhibited a discontinuous phase transition: 0.241 at R0 to 0.842 at R1, with no further change in R2–R5. This occurred because the starting molecule’s butyl ester triggered two Brenk structural alerts...

  75. [75]

    propose 3–5 modified molecules with chemical ratio- nale

    The interpretability–efficiency trade-off in generative design The evaluation question instructed the agent to “propose 3–5 modified molecules with chemical ratio- nale.” Instead, the agent employed REINVENT4 batch generation to produce 23–54 candidates per round and selected the best by QED ranking. This substitution raises a fundamental question about A...

  76. [76]

    unmonitored endpoint alarm

    Systematic blind spot detection in multi-objective optimization The agent’s failure to detect the AMES mutagenicity deterioration (+180%, from 0.165 to 0.462) de- spite tracking over 13 ADMET endpoints illustrates a general vulnerability of attention-based monitor- ing. The agent explicitly tracked CYP3A4, hERG and DILI at each round—all of which improved...

  77. [77]

    Cost-effectiveness and practical stopping rules Thediminishingreturnspattern(Fig.7H)hasdirectimplicationsforcomputationalresourceallocation. Assuming roughly equal tool-call costs per round, and measuring against the R0-to-R4 improvement (+0.4216) since R4 is the recommended molecule, R1 delivers 83.0%, R1–R2 deliver 88.1%, R1–R3 deliver 94.0%, and R1–R4 ...

  78. [78]

    Multi-round iterative optimization as an emergent agent capability The E2E-Q3 task required the AI agent to execute an iterative closed-loop optimization cycle— Strategize, Generate, Dock, Evaluate—across up to 15 rounds, with autonomous decision-making at each round boundary. This represents a fundamentally different challenge from single-step computa- t...

  79. [79]

    erlotinib

    Long-range planning, self-repair, and emergent medicinal chemistry knowledge The agent autonomously authored four pipeline versions (v1–v4, totaling 163 KB of Python), pro- gressively diagnosing and recovering from crashes: v1 failed due to NumPy/RDKit incompatibility, v2 succeeded through R1 but crashed on an f-string bug, v3 resumed from R2 with pre-pro...

  80. [80]

    ethynyl fixation

    Agent versus REINVENT: complementary collaboration rather than competition The 3:3 tie in round winners and the non-significant pooled comparison (p= 0.104) mask a deeper complementarity. REINVENT excelled at creative molecular recombination: it serendipitously dis- covered methoxy shortening in R1 (not hypothesized by the agent), generated the F+OH+CH3 m...

Showing first 80 references.