arxiv: 2604.21937 · v1 · submitted 2026-04-02 · 💻 cs.AI · cs.MA

Recognition: no theorem link

MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

Lisheng Zhang , Lilong Wang , Xiangyu Sun , Wei Tang , Haoyang Su , Yuehui Qian , Qikui Yang , Qingsong Li

show 9 more authors

Zhenyu Tang Haoran Sun Yingnan Han Yankai Jiang Wenjie Lou Bowen Zhou Xiaosong Wang Lei Bai Zhengwei Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:23 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords autonomous agentsdrug discoveryhierarchical skillsmolecular screeningAI agentsbenchmarksworkflow orchestrationcomputational chemistry

0 comments

The pith

A three-tier hierarchical skill architecture lets an AI agent reliably handle complex multi-step drug molecule workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MolClaw, an autonomous agent that integrates more than 30 specialized tools for drug molecule evaluation, screening, and optimization. Skills are organized into three tiers: basic tool operations, composed workflows that include quality checks and reflection, and a top-level discipline skill that supplies scientific principles for planning. This setup supports long sequences of tool calls that current agents handle poorly. The authors introduce MolBench, a benchmark with tasks ranging from 8 to over 50 sequential calls, and show that MolClaw reaches state-of-the-art results across metrics. Gains appear mainly on structured workflow tasks and disappear on simpler ad hoc problems, pointing to workflow orchestration as the main limit for AI in this domain.

Core claim

MolClaw unifies over 30 domain resources into 70 skills through a three-tier hierarchy that enables robust long-term interaction: tool-level skills standardize atomic operations, workflow-level skills assemble them into validated pipelines with built-in checks, and a discipline-level skill supplies governing scientific principles. On the introduced MolBench benchmark of molecular screening, optimization, and end-to-end discovery tasks, MolClaw attains state-of-the-art performance, with ablation results showing that improvements concentrate on tasks needing structured workflows and vanish on those solvable by ad hoc scripting.

What carries the argument

The three-tier hierarchical skill architecture that organizes atomic tool operations, validated multi-step pipelines, and field-wide scientific principles to support planning and verification.

If this is right

Agents gain the ability to maintain quality across dozens of sequential tool calls in drug discovery.
Performance improvements are tied directly to tasks that require validated workflow composition.
Workflow orchestration competence becomes the central capability needed to advance AI-driven drug discovery.
End-to-end discovery challenges spanning screening through optimization become feasible for autonomous agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layered skill design could transfer to other scientific fields that combine many specialized tools, such as materials design or synthetic biology.
Embedding skills at runtime may reduce dependence on ever-longer prompts for complex domains.
Testing the architecture on physical lab robots would show whether the digital workflow gains carry over to real experimental loops.
If workflow orchestration is the bottleneck, future agent work should prioritize pipeline validation over raw tool count.

Load-bearing premise

The three-tier hierarchical skill structure is the main reason for better performance rather than prompt engineering, tool selection, or benchmark tuning.

What would settle it

Measure whether removing the workflow-level or discipline-level skills causes MolClaw's advantage to disappear specifically on the 8-to-50-step tasks in MolBench while leaving simple tasks unchanged.

Figures

Figures reproduced from arXiv: 2604.21937 by Bowen Zhou, Haoran Sun, Haoyang Su, Lei Bai, Lilong Wang, Lisheng Zhang, Qikui Yang, Qingsong Li, Wei Tang, Wenjie Lou, Xiangyu Sun, Xiaosong Wang, Yankai Jiang, Yingnan Han, Yuehui Qian, Zhengwei Xie, Zhenyu Tang.

**Figure 2.** Figure 2: Agent execution traces for the three MolBench-E2E tasks. (A) E2E-Q1: coarse-grained conformational sampling. Five tool-level failures (red) were resolved via skill-governed recovery actions (orange), yielding 20 verified all-atom structures. (B) E2E-Q2: QED-driven iterative optimization. One tool fallback (F1), two constraintdriven rejections (F2–F3), and five strategy adaptations (D1–D5) were autonomous… view at source ↗

**Figure 3.** Figure 3: MolClaw achieves state-of-the-art performance across all MolBench evaluation dimensions. (A) Binding affinity comparison accuracy. MolClaw-CC achieves 81.1%. (B) Docking screening hit count. MolClaw-CC attains 0.80. (C) Molecule editing accuracy. MolClaw-CC reaches 100.0%. (D) Optimization success rate. (E) Property filtering accuracy. (F) Property filtering F1 score. (G) Agent systems grouped comparison a… view at source ↗

**Figure 4.** Figure 4: Statistical validation confirms the significance and reliability of MolClaw’s performance advantages. (A) Normalized performance heatmap across seven metrics for 13 methods. MolClaw variants highlighted by red borders. (B–E) Wilson score 95% CI forest plots for binding affinity accuracy (B), molecule editing accuracy (C), optimization success rate (D), and property filtering accuracy (E). (F–I) Category-l… view at source ↗

**Figure 5.** Figure 5: Ablation studies and in-depth statistical analyses reveal the mechanistic basis of MolClaw’s superiority. (A–C) Ablation on Claude Code and OpenClaw platforms: accuracy metrics (A), docking hit count (B), optimization delta (C). Largest skill-driven gain: binding affinity +29.7 pp (P = 0.013, h = 0.64). (D) Rank trajectory across four tasks for top six methods. (E) Average rank (Friedman χ 2 = 35.35, P = 2… view at source ↗

**Figure 6.** Figure 6: Coarse-grained conformational sampling of the EGFR kinase domain by OpenAWSEM and GoCa. (A) Superposition of 10 PULCHRA-reconstructed all-atom conformations from the OpenAWSEM ensemble, aligned to the 1M17 crystal structure. (B) Corresponding superposition for the GoCa ensemble. (C) Cα-RMSD to native structure: GoCa 4.54 ± 0.93 Å versus AWSEM 7.78 ± 1.53 Å (P = 7.69 × 10−4 ). (D) Radius of gyration: GoCa… view at source ↗

**Figure 7.** Figure 7: QED-driven iterative optimization of a triazolo-benzodiazepine scaffold by the AI agent. (A) Multi-dimensional property trajectory across five optimization rounds: QED score (target ≥ 0.70), MW, ALogP, Tanimoto similarity (constraint ≥ 0.40), TPSA and rotatable bonds. (B) QED desirability decomposition by component and round (R0–R5). (C) QED–Tanimoto trade-off for all 182 molecules; red stars, selected bes… view at source ↗

**Figure 8.** Figure 8: Comprehensive evaluation of AI-agent-driven iterative lead optimization of Erlotinib targeting the EGFR kinase domain. (A) Optimization trajectory showing best QuickVina docking score per round; blue dashed line: Erlotinib baseline (−6.9 kcal/mol); red dashed line: −8.9 kcal/mol target. (B) Docking score distributions across Rounds 1–6 (box-and-strip plot, n = 54). (C) Tanimoto similarity heatmap between r… view at source ↗

**Figure 9.** Figure 9: Schrödinger-style 2D protein–ligand interaction diagrams and 3D pose overlay. (A) Erlotinib baseline (−6.9 kcal/mol): two H-bonds (Thr766, Asp831) and eight hydrophobic contacts. (B) R1 best (−7.4): methoxy shortening + meta-Br; new Met769 H-bond (2.99 Å). (C) R2 best (−8.0): Br→F substitution; Met769 maintained. (D) R3 best (−8.3): F + OH + CH3 on aniline. (E) R4 best (−8.9, target met): 2,6-diF-4-OH anil… view at source ↗

**Figure 10.** Figure 10: Statistical validation, source attribution, and interaction conservation analysis. (A) Docking scores of all 54 molecules by source: REINVENT4 (blue, n = 20) and agent-designed (red, n = 34). (B) Per-round mean scores ± s.e.m. tested against baseline (Wilcoxon); R1 n.s., R2–R3 ∗∗, R4–R6 ∗∗∗. (C) Early (R1–R3) vs. late (R4–R6) violin plot (p = 1.24 × 10−4 ). (D) Agent vs. REINVENT violin plot (p = 0.104, n… view at source ↗

read the original abstract

Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi-step workflows, yet current AI agents struggle to maintain robust performance and consistently underperform in these high-complexity scenarios. Here we present MolClaw, an autonomous agent that leads drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources through a three-tier hierarchical skill architecture (70 skills in total) that facilitates agent long-term interaction at runtime: tool-level skills standardize atomic operations, workflow-level skills compose them into validated pipelines with quality check and reflection, and a discipline-level skill supplies scientific principles governing planning and verification across all scenarios in the field. Additionally, we introduce MolBench, a benchmark comprising molecular screening, optimization, and end-to-end discovery challenges spanning 8 to 50+ sequential tool calls. MolClaw achieves state-of-the-art performance across all metrics, and ablation studies confirm that gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting, establishing workflow orchestration competence as the primary capability bottleneck for AI-driven drug discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MolClaw applies a three-tier skill hierarchy to drug-discovery tool chains and adds a long-horizon benchmark, but the ablations do not cleanly isolate the hierarchy from prompt and baseline differences.

read the letter

The paper's main move is to wrap an agent around more than thirty chemistry tools using three explicit layers: atomic tool skills, workflow skills that add checks and reflection, and a top discipline layer that encodes scientific rules. They also release MolBench, which forces agents through 8-to-50-step sequences on screening, optimization, and full discovery tasks. That combination is the concrete new piece; prior agent work has hierarchies, but this one is tuned to the length and domain constraints of real molecular pipelines. The practical result is an agent that can keep going when flat tool-calling loops lose coherence, which matches the stated bottleneck in the field. The setup is straightforward to understand and the benchmark tasks are long enough to expose the problem they claim to solve. The soft spot is exactly where the stress-test note points: the ablations show gains on structured tasks that disappear on simple ones, yet the paper does not describe whether the baseline agents received the same level of workflow templates, error-handling language, or discipline-level guidance. If the comparison agents were given weaker prompting overall, the measured gap could be explained by prompt quality rather than the explicit three-tier split. The abstract also gives no numbers, error bars, or statistical detail, so the size of any real improvement stays hard to judge from the summary alone. This paper is aimed at people already building or evaluating agents for scientific tool use, especially in chemistry and early drug discovery. A reader who needs a worked example of multi-level orchestration or a test set with genuinely long tool sequences will get something usable from it. I would send it to peer review. The benchmark is a clear addition and the architecture is a reasonable engineering response to a known failure mode; the main claims simply need tighter controls on the baselines before they can be taken as settled.

Referee Report

3 major / 2 minor

Summary. The paper introduces MolClaw, an autonomous agent for drug molecule evaluation, screening, and optimization that integrates over 30 specialized resources via a three-tier hierarchical skill architecture (tool-level skills for atomic operations, workflow-level skills for validated pipelines with reflection, and discipline-level skills for scientific principles). It also presents MolBench, a benchmark of molecular tasks requiring 8–50+ sequential tool calls. The central claims are that MolClaw achieves state-of-the-art performance across all metrics on MolBench and that ablation studies show performance gains are concentrated on tasks needing structured workflows while disappearing on ad-hoc-scriptable tasks, identifying workflow orchestration as the key bottleneck for AI in drug discovery.

Significance. If the empirical claims hold after proper controls, the work would provide concrete evidence that explicit hierarchical skill decomposition improves long-horizon reliability in scientific agent workflows, a result with direct implications for AI-assisted drug discovery pipelines. The introduction of MolBench as a standardized, multi-step benchmark would also be a useful community resource. However, the absence of quantitative metrics, baseline details, and statistical tests in the abstract, combined with the stress-test concern on ablation isolation, leaves the magnitude and attribution of the advance currently unverifiable.

major comments (3)

Abstract and §4 (Ablation Studies): The claim that 'gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting' is load-bearing for the central thesis, yet the manuscript provides no description of how the ad-hoc baseline agents were prompted, whether they received the same 30+ resources, or how reflection/error-handling templates were matched. Without these controls, the performance gap cannot be confidently attributed to the three-tier hierarchy rather than differences in prompting or tool access.
Abstract and §3 (Experimental Setup): No quantitative metrics, error bars, baseline names, or statistical tests are reported even in the abstract, despite the SOTA claim. The full methods section must supply exact performance numbers, number of runs, and comparison agents (including their architecture and prompting) before the SOTA assertion can be evaluated.
§2.2 (Hierarchical Skill Architecture): The discipline-level skill is described as supplying 'scientific principles governing planning and verification,' but the manuscript does not show how these principles are encoded or enforced at runtime versus being implicit in the workflow-level reflection steps. This distinction is necessary to isolate the contribution of the explicit hierarchy.

minor comments (2)

Figure 1 and §2.1: The diagram of the three-tier architecture would benefit from explicit arrows showing runtime control flow between levels and from a legend distinguishing static skill definitions from dynamic invocation.
§5 (Related Work): The comparison to prior agent frameworks (e.g., ReAct, Toolformer) should include a table contrasting skill hierarchy depth, number of tools, and benchmark task length to clarify novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications based on the manuscript content and outlining specific revisions to improve transparency and rigor.

read point-by-point responses

Referee: Abstract and §4 (Ablation Studies): The claim that 'gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting' is load-bearing for the central thesis, yet the manuscript provides no description of how the ad-hoc baseline agents were prompted, whether they received the same 30+ resources, or how reflection/error-handling templates were matched. Without these controls, the performance gap cannot be confidently attributed to the three-tier hierarchy rather than differences in prompting or tool access.

Authors: We agree that explicit controls are essential for attributing gains to the hierarchy. The ad-hoc baselines received identical access to the full set of over 30 resources and used the same tool-calling format; they differed only by omitting workflow-level composition and discipline-level principle injection, relying on standard ReAct-style prompting without structured reflection templates. To strengthen this, we will expand §4 with a new subsection containing example prompts for each baseline, a configuration comparison table, and confirmation that error-handling was matched at the tool level only. This revision will make the isolation of the hierarchy's contribution fully verifiable. revision: yes
Referee: Abstract and §3 (Experimental Setup): No quantitative metrics, error bars, baseline names, or statistical tests are reported even in the abstract, despite the SOTA claim. The full methods section must supply exact performance numbers, number of runs, and comparison agents (including their architecture and prompting) before the SOTA assertion can be evaluated.

Authors: We acknowledge that the abstract prioritizes high-level claims over numbers due to length limits, but the full §3 and §4 already contain performance tables, baseline names (ReAct, Reflexion, and standard LLM agents), and run counts. We will revise the abstract to report key metrics (e.g., success rates with standard deviations from 5 runs per task) and add explicit statistical tests (paired t-tests, p < 0.05) to §3. All baseline architectures and prompting details will be cross-referenced to the new ablation subsection for completeness. revision: yes
Referee: §2.2 (Hierarchical Skill Architecture): The discipline-level skill is described as supplying 'scientific principles governing planning and verification,' but the manuscript does not show how these principles are encoded or enforced at runtime versus being implicit in the workflow-level reflection steps. This distinction is necessary to isolate the contribution of the explicit hierarchy.

Authors: The discipline-level skill is implemented as an explicit, callable module containing a fixed set of encoded scientific principles (e.g., ADMET rules, synthetic accessibility heuristics) that are injected via a dedicated runtime function call before planning and verification phases. This operates independently of workflow-level reflection, which only processes execution feedback. We will add pseudocode and a runtime diagram to §2.2 illustrating the distinct call sequence and enforcement mechanism to clearly separate the two levels. revision: yes

Circularity Check

0 steps flagged

No circularity: new architecture and benchmark evaluated empirically

full rationale

The paper introduces MolClaw's three-tier hierarchy and MolBench as original contributions, with SOTA claims and ablation results presented as direct empirical outcomes on the new benchmark rather than reductions to prior inputs. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the derivation of the core claims. Ablation statements about gains vanishing on ad-hoc tasks are framed as experimental observations, not tautological by construction. The chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the agent and benchmark themselves; full paper would be needed to audit any hidden hyperparameters or domain assumptions about tool reliability.

invented entities (1)

MolClaw hierarchical skill architecture no independent evidence
purpose: To enable long-term robust interaction across 30+ domain tools for drug discovery workflows
The three-tier system (tool, workflow, discipline) is presented as the core innovation; no independent evidence outside the paper is supplied in the abstract.

pith-pipeline@v0.9.0 · 5559 in / 1330 out tokens · 35758 ms · 2026-05-13T21:23:01.068824+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 4 internal anchors

[1]

Gromacs: High performance molecular simulations through multi-level paral- lelism from laptops to supercomputers.SoftwareX, 1:19–25, 2015

Mark James Abraham, Teemu Murtola, Roland Schulz, Szilárd Páll, Jeremy C Smith, Berk Hess, and Erik Lindahl. Gromacs: High performance molecular simulations through multi-level paral- lelism from laptops to supercomputers.SoftwareX, 1:19–25, 2015

work page 2015
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 27

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Alphafold db: Open repository of protein structure predictions

AlphaFold Database Consortium. Alphafold db: Open repository of protein structure predictions. https://alphafold.ebi.ac.uk/, 2024

work page 2024
[4]

The claude model family.https://www.anthropic.com/claude, 2024

Anthropic. The claude model family.https://www.anthropic.com/claude, 2024

work page 2024
[5]

Claude code: a command-line tool for agentic coding.https://docs.anthropic.c om/en/docs/claude-code, 2025

Anthropic. Claude code: a command-line tool for agentic coding.https://docs.anthropic.c om/en/docs/claude-code, 2025

work page 2025
[6]

Liddia: Language-based intelligent drug discovery agent

Reza Averly, Frazier N Baker, Ian A Watson, and Xia Ning. Liddia: Language-based intelligent drug discovery agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12015–12039, 2025

work page 2025
[7]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Artificial intelligence in drug discovery: what is realistic, what are illusions? part 1: Ways to make an impact, and why we are not there yet

Andreas Bender and Isidro Cortés-Ciriano. Artificial intelligence in drug discovery: what is realistic, what are illusions? part 1: Ways to make an impact, and why we are not there yet. Drug discovery today, 26(2):511–524, 2021

work page 2021
[9]

Quantifying the chemical beauty of drugs.Nature Chemistry, 4:90–98, 2012

G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs.Nature Chemistry, 4:90–98, 2012

work page 2012
[10]

Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

work page 2023
[11]

Prolif: a library to encode molecular interactions as fingerprints.Journal of cheminformatics, 13(1):72, 2021

Cédric Bouysset and Sébastien Fiorucci. Prolif: a library to encode molecular interactions as fingerprints.Journal of cheminformatics, 13(1):72, 2021

work page 2021
[12]

Evobind: in silico directed evolution of peptide binders with alphafold.bioRxiv, pages 2022–07, 2022

Patrick Bryant and Arne Elofsson. Evobind: in silico directed evolution of peptide binders with alphafold.bioRxiv, pages 2022–07, 2022

work page 2022
[13]

Generic protein–ligand interaction scoring by inte- grating physical prior knowledge and data augmentation modelling.Nature Machine Intelligence, 6(6):688–700, 2024

Duanhua Cao, Geng Chen, Jiaxin Jiang, Jie Yu, Runze Zhang, Mingan Chen, Wei Zhang, Lifan Chen, Feisheng Zhong, Yingying Zhang, et al. Generic protein–ligand interaction scoring by inte- grating physical prior knowledge and data augmentation modelling.Nature Machine Intelligence, 6(6):688–700, 2024

work page 2024
[14]

Mozi: Governed autonomy for drug discovery llm agents.arXiv preprint arXiv:2603.03655, 2026

He Cao, Siyu Liu, Fan Zhang, Zijing Liu, Hao Li, Bin Feng, Shengyuan Bai, Leqing Chen, Kai Xie, and Yu Li. Mozi: Governed autonomy for drug discovery llm agents.arXiv preprint arXiv:2603.03655, 2026

work page arXiv 2026
[15]

Chembl: The global bioactivity database for drug discovery.https: //www.ebi.ac.uk/chembl/, 2024

ChEMBL Consortium. Chembl: The global bioactivity database for drug discovery.https: //www.ebi.ac.uk/chembl/, 2024

work page 2024
[16]

Diffdock: Diffusion steps, twists, and turns for molecular docking.arXiv preprint arXiv:2210.01776, 2022

Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, and Tommi Jaakkola. Diffdock: Diffusion steps, twists, and turns for molecular docking.arXiv preprint arXiv:2210.01776, 2022

work page arXiv 2022
[17]

Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

work page 2022
[18]

Pymol: An open-source molecular graphics tool.CCP4 Newsl

Warren L DeLano et al. Pymol: An open-source molecular graphics tool.CCP4 Newsl. protein crystallogr, 40(1):82–92, 2002

work page 2002
[19]

Leading ai-driven drug discovery platforms: 2025 landscape and global outlook

Mahendiran Dharmasivam, Busra Kaya, Adedoyin Akinware, Mahan Gholam Azad, and Des R Richardson. Leading ai-driven drug discovery platforms: 2025 landscape and global outlook. Pharmacological Reviews, page 100102, 2025

work page 2025
[20]

Vina-gpu 2.0: further accelerating autodock vina and its derivatives with graphics processing units.Journal of chemical information and modeling, 63(7):1982–1998, 2023

Ji Ding, Shidi Tang, Zheming Mei, Lingyue Wang, Qinqin Huang, Haifeng Hu, Ming Ling, and Jiansheng Wu. Vina-gpu 2.0: further accelerating autodock vina and its derivatives with graphics processing units.Journal of chemical information and modeling, 63(7):1982–1998, 2023. 28

work page 1982
[21]

Openmm 7: Rapid development of high performance algorithms for molecular dynamics.PLoS computational biology, 13(7):e1005659, 2017

Peter Eastman, Jason Swails, John D Chodera, Robert T McGibbon, Yutong Zhao, Kyle A Beauchamp, Lee-Ping Wang, Andrew C Simmonett, Matthew P Harrigan, Chaya D Stern, et al. Openmm 7: Rapid development of high performance algorithms for molecular dynamics.PLoS computational biology, 13(7):e1005659, 2017

work page 2017
[22]

Autodock vina 1.2

Jerome Eberhardt, Diogo Santos-Martins, Andreas F Tillack, and Stefano Forli. Autodock vina 1.2. 0: new docking methods, expanded force field, and python bindings.Journal of chemical information and modeling, 61(8):3891–3898, 2021

work page 2021
[23]

Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery.arXiv preprint arXiv:2602.08990, 2026

Shiyang Feng, Runmin Ma, Xiangchao Yan, Yue Fan, Yusong Hu, Songtao Huang, Shuaiyu Zhang, Zongsheng Cao, Tianshuo Peng, Jiakang Yuan, et al. Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery.arXiv preprint arXiv:2602.08990, 2026

work page arXiv 2026
[24]

Glide: a new approach for rapid, accurate docking and scoring

Richard A Friesner, Jay L Banks, Robert B Murphy, Thomas A Halgren, Jasna J Klicic, Daniel T Mainz, Matthew P Repasky, Eric H Knoll, Mee Shelley, Jason K Perry, et al. Glide: a new approach for rapid, accurate docking and scoring. 1. method and assessment of docking accuracy. Journal of medicinal chemistry, 47(7):1739–1749, 2004

work page 2004
[25]

Empowering biomedical discovery with ai agents.Cell, 187(22):6125–6151, 2024

Shanghua Gao, Ada Fang, Yepeng Huang, Valentina Giunchiglia, Ayush Noori, Jonathan Richard Schwarz, Yasha Ektefaie, Jovana Kondic, and Marinka Zitnik. Empowering biomedical discovery with ai agents.Cell, 187(22):6125–6151, 2024

work page 2024
[26]

Txagent: an ai agent for therapeutic reasoning across a universe of tools.arXiv preprint arXiv:2503.10970, 2025

Shanghua Gao, Richard Zhu, Zhenglun Kong, Ayush Noori, Xiaorui Su, Curtis Ginder, Theodoros Tsiligkaridis, and Marinka Zitnik. Txagent: an ai agent for therapeutic reasoning across a universe of tools.arXiv preprint arXiv:2503.10970, 2025

work page arXiv 2025
[27]

Democratising real-world drug discovery through agentic ai.Drug Discovery Today, page 104605, 2026

Jiazhen He, Helen Lai, Lakshidaa Saigiridharan, Gian Marco Ghiandoni, Kinga Jenei, Umur Gokalp, Ajsa Nukovic, Ola Engkvist, Jon Paul Janet, and Samuel Genheden. Democratising real-world drug discovery through agentic ai.Drug Discovery Today, page 104605, 2026

work page 2026
[28]

Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent.biorxiv, 2025

work page 2025
[29]

Illuminating protein space with a programmable generative model.Nature, 623(7989):1070–1078, 2023

John B Ingraham, Max Baranov, Zak Costello, Karl W Barber, Wujie Wang, Ahmed Ismail, VincentFrappier, DanaMLord, ChristopherNg-Thow-Hing, ErikRVanVlack, etal. Illuminating protein space with a programmable generative model.Nature, 623(7989):1070–1078, 2023

work page 2023
[30]

Scp: Accelerating discovery with a global web of autonomous scientific agents.arXiv preprint arXiv:2512.24189, 2025

Yankai Jiang, Wenjie Lou, Lilong Wang, Zhenyu Tang, Shiyang Feng, Jiaxuan Lu, Haoran Sun, Yaning Pan, Shuang Gu, Haoyang Su, et al. Scp: Accelerating discovery with a global web of autonomous scientific agents.arXiv preprint arXiv:2512.24189, 2025

work page arXiv 2025
[31]

Deepsite: protein-binding site predictor using 3d-convolutional neural networks.Bioinformatics, 33(19):3036–3042, 2017

José Jiménez, Stefan Doerr, Gerard Martínez-Rosell, Alexander S Rose, and Gianni De Fabritiis. Deepsite: protein-binding site predictor using 3d-convolutional neural networks.Bioinformatics, 33(19):3036–3042, 2017

work page 2017
[32]

Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021

JohnJumper, RichardEvans, AlexanderPritzel, TimGreen, MichaelFigurnov, OlafRonneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021

work page 2021
[33]

P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure.Journal of cheminformatics, 10(1):39, 2018

Radoslav Krivák and David Hoksza. P2rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure.Journal of cheminformatics, 10(1):39, 2018

work page 2018
[34]

Fpocket: an open source platform for ligand pocket detection.BMC bioinformatics, 10(1):168, 2009

Vincent Le Guilloux, Peter Schmidtke, and Pierre Tuffery. Fpocket: an open source platform for ligand pocket detection.BMC bioinformatics, 10(1):168, 2009

work page 2009
[35]

Scalable emulation of protein equilibrium ensembles with generative deep learning.Science, 389(6761):eadv9817, 2025

Sarah Lewis, Tim Hempel, José Jiménez-Luna, Michael Gastegger, Yu Xie, Andrew YK Foong, Victor García Satorras, Osama Abdin, Bastiaan S Veeling, Iryna Zaporozhets, et al. Scalable emulation of protein equilibrium ensembles with generative deep learning.Science, 389(6761):eadv9817, 2025. 29

work page 2025
[36]

Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations.arXiv preprint arXiv:2505.21318, 2025

Hao Li, He Cao, Bin Feng, Yanjun Shao, Xiangru Tang, Zhiyuan Yan, Li Yuan, Yonghong Tian, and Yu Li. Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations.arXiv preprint arXiv:2505.21318, 2025

work page arXiv 2025
[37]

Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

work page 2023
[38]

Drugagent: Automating ai-aided drug discovery programming through llm multi-agent collaboration.arXiv preprint arXiv:2411.15692, 2024

SizheLiu, YizhouLu, SiyuChen, XiyangHu, JieyuZhao, YingzhouLu, andYueZhao. Drugagent: Automating ai-aided drug discovery programming through llm multi-agent collaboration.arXiv preprint arXiv:2411.15692, 2024

work page arXiv 2024
[39]

Reinvent 4: Modern ai–driven generative molecule design.Journal of Cheminformatics, 16(1):20, 2024

Hannes H Loeffler, Jiazhen He, Alessandro Tibo, Jon Paul Janet, Alexey Voronov, Lewis H Mervin, and Ola Engkvist. Reinvent 4: Modern ai–driven generative molecule design.Journal of Cheminformatics, 16(1):20, 2024

work page 2024
[40]

Openawsem with open3spn2: A fast, flexible, and accessible framework for large-scale coarse-grained biomolecular simulations.PLoS computational biology, 17(2):e1008308, 2021

Wei Lu, Carlos Bueno, Nicholas P Schafer, Joshua Moller, Shikai Jin, Xun Chen, Mingchen Chen, Xinyu Gu, Aram Davtyan, Juan J de Pablo, et al. Openawsem with open3spn2: A fast, flexible, and accessible framework for large-scale coarse-grained biomolecular simulations.PLoS computational biology, 17(2):e1008308, 2021

work page 2021
[41]

Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller

Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Augmenting large language models with chemistry tools.Nature machine intelligence, 6(5):525–535, 2024

work page 2024
[42]

Augmented language models: a survey

Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023

work page arXiv 2023
[43]

Openclaw: an open-source framework for building tool-augmented llm agents.https://github.com/openclaw, 2025

OpenClaw Contributors. Openclaw: an open-source framework for building tool-augmented llm agents.https://github.com/openclaw, 2025

work page 2025
[44]

Frogent: An end-to-end full-process drug design agent.arXiv preprint arXiv:2508.10760, 2025

Qihua Pan, Dong Xu, Jenna Xinyi Yao, Lijia Ma, Zexuan Zhu, and Junkai Ji. Frogent: An end-to-end full-process drug design agent.arXiv preprint arXiv:2508.10760, 2025

work page arXiv 2025
[45]

Boltz-2: Towards accurate and efficient binding affinity prediction.BioRxiv, 2025

Saro Passaro, Gabriele Corso, Jeremy Wohlwend, Mateo Reveiz, Stephan Thaler, Vignesh Ram Somnath, Noah Getz, Tally Portnoi, Julien Roy, Hannes Stark, et al. Boltz-2: Towards accurate and efficient binding affinity prediction.BioRxiv, 2025

work page 2025
[46]

Pubchem: Open chemistry database.https://pubchem.ncbi.nlm.nih .gov/, 2025

PubChem Consortium. Pubchem: Open chemistry database.https://pubchem.ncbi.nlm.nih .gov/, 2025

work page 2025
[47]

Rcsb pdb: Research collaboratory for structural bioinformatics protein data bank.https://www.rcsb.org/, 2024

RCSB PDB Consortium. Rcsb pdb: Research collaboratory for structural bioinformatics protein data bank.https://www.rcsb.org/, 2024

work page 2024
[48]

Rdkit: Open-source cheminformatics software.https://www.rdkit.org/, 2024

RDKit Consortium. Rdkit: Open-source cheminformatics software.https://www.rdkit.org/, 2024

work page 2024
[49]

Fast procedure for reconstruction of full-atom protein models from reduced representations.Journal of computational chemistry, 29(9):1460–1465, 2008

Piotr Rotkiewicz and Jeffrey Skolnick. Fast procedure for reconstruction of full-atom protein models from reduced representations.Journal of computational chemistry, 29(9):1460–1465, 2008

work page 2008
[50]

Computational approaches streamlining drug discovery.Nature, 616(7958):673–685, 2023

Anastasiia V Sadybekov and Vsevolod Katritch. Computational approaches streamlining drug discovery.Nature, 616(7958):673–685, 2023

work page 2023
[51]

Plip: fully automated protein–ligand interaction profiler.Nucleic acids research, 43(W1):W443– W447, 2015

Sebastian Salentin, Sven Schreiber, V Joachim Haupt, Melissa F Adasme, and Michael Schroeder. Plip: fully automated protein–ligand interaction profiler.Nucleic acids research, 43(W1):W443– W447, 2015

work page 2015
[52]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023. 30

work page 2023
[53]

Understanding and predicting druggability

Peter Schmidtke and Xavier Barril. Understanding and predicting druggability. a high-throughput method for detection of drug binding sites.Journal of medicinal chemistry, 53(15):5858–5867, 2010

work page 2010
[54]

Why 90% of clinical drug development fails and how to improve it?Acta Pharmaceutica Sinica B, 12(7):3049–3062, 2022

Duxin Sun, Wei Gao, Hongxiang Hu, and Simon Zhou. Why 90% of clinical drug development fails and how to improve it?Acta Pharmaceutica Sinica B, 12(7):3049–3062, 2022

work page 2022
[55]

Admet-ai: a machine learning admet platform for evaluation of large-scale chemical libraries.Bioinformatics, 40(7):btae416, 2024

Kyle Swanson, Parker Walther, Jeremy Leitz, Souhrid Mukherjee, Joseph C Wu, Rabindra V Shivnaraine, and James Zou. Admet-ai: a machine learning admet platform for evaluation of large-scale chemical libraries.Bioinformatics, 40(7):btae416, 2024

work page 2024
[56]

Chai-1: Decoding the molecular interactions of life

Chai Discovery team, Jacques Boitreaud, Jack Dent, Matthew McPartlon, Joshua Meier, Vinicius Reis, Alex Rogozhonikov, and Kevin Wu. Chai-1: Decoding the molecular interactions of life. BioRxiv, pages 2024–10, 2024

work page 2024
[57]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: A family of highly capable multimodal models. arxiv 2023.arXiv preprint arXiv:2312.11805, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Uniprot: The universal protein knowledgebase.https://www.unipro t.org/, 2025

The UniProt Consortium. Uniprot: The universal protein knowledgebase.https://www.unipro t.org/, 2025

work page 2025
[59]

Benchmarking compound activity prediction for real-world drug discovery applications.Commu- nications Chemistry, 7(1):127, 2024

Tingzhong Tian, Shuya Li, Ziting Zhang, Lin Chen, Ziheng Zou, Dan Zhao, and Jianyang Zeng. Benchmarking compound activity prediction for real-world drug discovery applications.Commu- nications Chemistry, 7(1):127, 2024

work page 2024
[60]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

gmx_mmpbsa: a new tool to perform end-state free energy calculations with gromacs.Journal of chemical theory and computation, 17(10):6281–6291, 2021

Mario S Valdés-Tresanco, Mario E Valdés-Tresanco, Pedro A Valiente, and Ernesto Moreno. gmx_mmpbsa: a new tool to perform end-state free energy calculations with gromacs.Journal of chemical theory and computation, 17(10):6281–6291, 2021

work page 2021
[62]

Applications of machine learning in drug discovery and development.Nature reviews Drug discovery, 18(6):463–477, 2019

Jessica Vamathevan, Dominic Clark, Paul Czodrowski, Ian Dunham, Edgardo Ferran, George Lee, Bin Li, Anant Madabhushi, Parantu Shah, Michaela Spitzer, et al. Applications of machine learning in drug discovery and development.Nature reviews Drug discovery, 18(6):463–477, 2019

work page 2019
[63]

Prompt-to-pill: Multi- agent drug discovery and clinical simulation pipeline.Bioinformatics Advances, 6(1):vbaf323, 2026

Ivana Vichentijevikj, Kostadin Mishev, and Monika Simjanoska Misheva. Prompt-to-pill: Multi- agent drug discovery and clinical simulation pipeline.Bioinformatics Advances, 6(1):vbaf323, 2026

work page 2026
[64]

Structure-based protein assembly simu- lations including various binding sites and conformations.Journal of Chemical Information and Modeling, 64(8):3465–3476, 2024

Luis J Walter, Patrick K Quoika, and Martin Zacharias. Structure-based protein assembly simu- lations including various binding sites and conformations.Journal of Chemical Information and Modeling, 64(8):3465–3476, 2024

work page 2024
[65]

Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023

Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence.Nature, 620(7972):47–60, 2023

work page 2023
[66]

Estimated research and development investment needed to bring a new medicine to market, 2009-2018.Jama, 323(9):844–853, 2020

Olivier J Wouters, Martin McKee, and Jeroen Luyten. Estimated research and development investment needed to bring a new medicine to market, 2009-2018.Jama, 323(9):844–853, 2020

work page 2009
[67]

The hdock server for integrated protein–protein docking.Nature protocols, 15(5):1829–1852, 2020

Yumeng Yan, Huanyu Tao, Jiahua He, and Sheng-You Huang. The hdock server for integrated protein–protein docking.Nature protocols, 15(5):1829–1852, 2020

work page 2020
[68]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022
[69]

Efficient and accurate large library ligand docking with karmadock.Nature Computational Science, 3(9):789–804, 2023

Xujun Zhang, Odin Zhang, Chao Shen, Wanglin Qu, Shicheng Chen, Hanqun Cao, Yu Kang, Zhe Wang, Ercheng Wang, Jintu Zhang, et al. Efficient and accurate large library ligand docking with karmadock.Nature Computational Science, 3(9):789–804, 2023. 31

work page 2023
[70]

Activity cliff prediction: Dataset and benchmark.arXiv preprint arXiv:2302.07541, 2023

Ziqiao Zhang, Bangyi Zhao, Ailin Xie, Yatao Bian, and Shuigeng Zhou. Activity cliff prediction: Dataset and benchmark.arXiv preprint arXiv:2302.07541, 2023

work page arXiv 2023
[71]

delete hydroxyl

Jie Zhu, Jingxiang Wang, Xin Wang, Mingjing Gao, Bingbing Guo, Miaomiao Gao, Jiarui Liu, Yanqiu Yu, Liang Wang, Weikaixin Kong, et al. Prediction of drug efficacy from transcriptional profiles with deep learning.Nature biotechnology, 39(11):1444–1452, 2021. Data A vailability Both the MolBench dataset (CSV format) and associated evaluation code can be acc...

work page 2021
[72]

Here we derive this bound analytically

Predicting the optimization ceiling from scaffold topology A central finding of this study is that the triazolo-benzodiazepine scaffold imposes a hard upper bound on the achievable QED score. Here we derive this bound analytically. QED is defined as a weighted geometric mean of eight component desirability functionsdi (ref. [9]): QED= exp P8 i=1 wi lnd i ...

work page
[73]

Tanimoto budget exhaustion

Tanimoto budget exhaustion as a convergence diagnostic In our main Results we noted that the qualification rate—the fraction of generated molecules satisfying the Tanimoto≥0.40 constraint—declined from 100% (R1–R2) to 57.6% (R5). We propose that this declining rate constitutes a generalizable convergence diagnostic that we term “Tanimoto budget exhaustion...

work page
[74]

The structural alerts (ALERTS) desirability exhibited a discontinuous phase transition: 0.241 at R0 to 0.842 at R1, with no further change in R2–R5

Phase transitions versus gradual improvement in property optimization Not all QED components improved gradually. The structural alerts (ALERTS) desirability exhibited a discontinuous phase transition: 0.241 at R0 to 0.842 at R1, with no further change in R2–R5. This occurred because the starting molecule’s butyl ester triggered two Brenk structural alerts...

work page
[75]

propose 3–5 modified molecules with chemical ratio- nale

The interpretability–efficiency trade-off in generative design The evaluation question instructed the agent to “propose 3–5 modified molecules with chemical ratio- nale.” Instead, the agent employed REINVENT4 batch generation to produce 23–54 candidates per round and selected the best by QED ranking. This substitution raises a fundamental question about A...

work page
[76]

unmonitored endpoint alarm

Systematic blind spot detection in multi-objective optimization The agent’s failure to detect the AMES mutagenicity deterioration (+180%, from 0.165 to 0.462) de- spite tracking over 13 ADMET endpoints illustrates a general vulnerability of attention-based monitor- ing. The agent explicitly tracked CYP3A4, hERG and DILI at each round—all of which improved...

work page
[77]

Cost-effectiveness and practical stopping rules Thediminishingreturnspattern(Fig.7H)hasdirectimplicationsforcomputationalresourceallocation. Assuming roughly equal tool-call costs per round, and measuring against the R0-to-R4 improvement (+0.4216) since R4 is the recommended molecule, R1 delivers 83.0%, R1–R2 deliver 88.1%, R1–R3 deliver 94.0%, and R1–R4 ...

work page
[78]

Multi-round iterative optimization as an emergent agent capability The E2E-Q3 task required the AI agent to execute an iterative closed-loop optimization cycle— Strategize, Generate, Dock, Evaluate—across up to 15 rounds, with autonomous decision-making at each round boundary. This represents a fundamentally different challenge from single-step computa- t...

work page
[79]

erlotinib

Long-range planning, self-repair, and emergent medicinal chemistry knowledge The agent autonomously authored four pipeline versions (v1–v4, totaling 163 KB of Python), pro- gressively diagnosing and recovering from crashes: v1 failed due to NumPy/RDKit incompatibility, v2 succeeded through R1 but crashed on an f-string bug, v3 resumed from R2 with pre-pro...

work page
[80]

ethynyl fixation

Agent versus REINVENT: complementary collaboration rather than competition The 3:3 tie in round winners and the non-significant pooled comparison (p= 0.104) mask a deeper complementarity. REINVENT excelled at creative molecular recombination: it serendipitously dis- covered methoxy shortening in R1 (not hypothesized by the agent), generated the F+OH+CH3 m...

work page

Showing first 80 references.