pith. machine review for the scientific record. sign in

arxiv: 2605.12181 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: no theorem link

MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:19 UTC · model grok-4.3

classification 💻 cs.AI
keywords molecular detoxificationfragment editinglanguage modelsmolecular optimizationstructural validitybenchmark evaluationtoxicity assessmentdrug discovery
0
0 comments X

The pith

Fragment-level molecule editing in language models raises structural validity and quality for detoxification tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MolDeTox as a benchmark that breaks molecular detoxification into stepwise fragment-editing tasks for language models and vision-language models. It establishes that processing molecules at the fragment level rather than as complete structures leads to higher structural validity and better overall quality in the generated outputs. This matters because prior toxicity repair methods suffer from low diversity in data, frequent invalid structures, and unreliable reliance on proxy toxicity checks. The benchmark further supplies task-by-task breakdowns that make the detoxification steps more interpretable across different models and prompting styles.

Core claim

MolDeTox evaluates general-purpose LLMs and VLMs on stepwise fragment editing for molecular detoxification. The central claim is that understanding and generating molecules at the fragment level improves structural validity and enhances the quality of generated molecules while supplying an interpretable benchmark for the detoxification process.

What carries the argument

The MolDeTox benchmark, built around stepwise fragment-editing tasks that isolate toxicity removal while preserving other molecular properties.

If this is right

  • Fragment-level processing produces a higher fraction of chemically valid detoxified molecules than whole-structure approaches.
  • Task-level analysis isolates which steps in detoxification models handle reliably and where they fail.
  • The same general-purpose models show measurable gains when the input format shifts to fragments.
  • The benchmark enables direct comparison of LLMs and VLMs under consistent toxicity-aware conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fragment approach could transfer to other molecular optimization goals such as potency or solubility changes.
  • Wider use of stepwise fragment benchmarks might lower dependence on full experimental validation during early screening.
  • Future versions would benefit from pairing the proxy toxicity scores with at least a subset of real assay results to test alignment.

Load-bearing premise

Proxy models for toxicity assessment are accurate enough and the benchmark's added data diversity and validity metrics fully overcome the shortcomings of earlier toxicity repair tests.

What would settle it

Running the same models on MolDeTox tasks with whole-molecule editing instead of fragment steps and finding no gain in structural validity or quality scores.

Figures

Figures reproduced from arXiv: 2605.12181 by Jaewoo Kang, Jiwoo Lee, Jueon Park, Wonjune Jang, Yein Park.

Figure 1
Figure 1. Figure 1: MolDeTox Overview: Performance comparison of LLMs and VLMs across three detoxifi￾cation tasks, evaluated using accuracy (%) cardiotoxicity [10]. Nevertheless, there have been efforts that focus exclusively on toxicity, such as ToxiMol [13], which evaluates the molecule repair capabilities of multimodal large language models (MLLMs) across multiple toxicity endpoints. Despite its contributions, this benchma… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of MolDeTox. The upper panel illustrates the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Property distributions of toxic and non-toxic [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: GPT-5.2 4-shot [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of Pattern T1=1, T2=1, T3=1 (C111). In successful cases, final out￾comes are typically accompanied by at least partially correct in￾termediate steps. These sam￾ples are concentrated in C111 (0.5053) and C101 (0.2368), showing that Task 3 success most often arises either from fully correct step-wise execution or from cases where correct toxic￾fragment identification alone is sufficient to support th… view at source ↗
Figure 6
Figure 6. Figure 6: Structural alert overlap for the top-20 Ames Mutagenicity toxicity-associated fragments. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Structural alert overlap for the top-20 Skin Reaction toxicity-associated fragments. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of C000. Toxic Molecule ID: 191 - Endpoint: - hERG - SMILES: - Nc1nc(=O)n([C@@H]2CS[C@H](CO)O2)cc1F - SAFE: - Nc1nc(=O)n4cc1F.[C@@H]14CS[C@H]3O1.C3O Case100 Task1 Task2 Task3 Gold Answer Pred Answer Nc1nc(=O)n4cc1F Nc1ccn4c(=O)n1 Nc1nc(=O)n4cc1F Nc1ccn4c(=O)n1.[C@@H]14CS[C@H]3O1.C3O Nc1nc(=O)n4cc1Br Nc1nc(=O)n4cc1Cl.[C@@H]14CS[C@H]3O1.C3O [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of C100 [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of C010. Toxic Molecule ID:1005 - Endpoint: - CYP P450 3A4 Inhibition - SMILES: - C(C1=C(C)CCCC1(C)C)=CC(C)=CC=CC(C)=CC(=O)O - SAFE: - C=3C1=C(C)CCCC1(C)C.CC=2C=3.C=4C=5C.C=2C=4.C=5C(=O)O Case110 Task1 Task2 Task3 Gold Answer Pred Answer C=5C(=O)O C=5CO C=5CO C=5C(=O)O C=3C1=C(C)CCCC1(C)C. … .C=2C=4.C=5CO C=3C1=C(C)CCCC1(C)C. … .C=2C=4.C=5C(=O)OC [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of C110. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example of C001. Case101 Task1 Task2 Task3 Gold Answer Pred Answer N16CCS(=O)(=O)CC1 N16CCOCC1 [C@H]19CNCC[C… .N16CCOCC1. … .CO5.C7%12 N16CCS(=O)(=O)CC1 N16CCCS(=O)(=O)CC1 [C@H]19CNCC[C… .N16CCOCC1. … .CO5.C7%12 Toxic Molecule ID: 358 - Endpoint: - hERG - SMILES: - [C@H]1(C(N(C2CC2)…CC3)cc(CCCOC)c2)=O)C…cc(F)c(F)cc12 - SAFE: - [C@H]19CNCC[C@@]12O….N16CCS(=O)(=O)CC1. … .CO5.C7%12 [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 13
Figure 13. Figure 13: Example of C101. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example of C011. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
read the original abstract

Large Language Models (LLMs) and Vision Language Models (VLMs) have recently shown promising capabilities in various scientific domain. In particular, these advances have opened new opportunities in drug discovery, where the ability to understand and modify molecular structures is critical for optimizing drug properties such as efficacy and toxicity. However, existing models and benchmarks often overlook toxicity-related challenges, focusing primarily on general property optimization without adequately addressing safety concerns. In addition, even existing toxicity repair benchmarks suffer from limited data diversity, low structural validity of generated molecules, and heavy reliance on proxy models for toxicity assessment. To address these limitations, we propose MolDeTox, a novel benchmark for molecular detoxification, designed to enable fine-grained and reliable evaluation of toxicity-aware molecular optimization across stepwise tasks. We evaluate a wide range of general-purpose LLMs and VLMs under diverse settings, and demonstrate that understanding and generating molecules at the fragment-level improves structural validity and enhances the quality of generated molecules. Moreover, through detailed task-level performance analysis, MolDeTox provides an interpretable benchmark that enables a deeper understanding of the detoxification process. Our dataset is available at : https://huggingface.co/datasets/MolDeTox/MolDeTox

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MolDeTox, a new benchmark for evaluating LLMs and VLMs on stepwise fragment-level editing tasks for molecular detoxification. It critiques prior toxicity repair benchmarks for limited data diversity, low structural validity, and proxy-model reliance, then claims that fragment-level molecular understanding and generation improves structural validity and overall quality of detoxified molecules while enabling interpretable task-level analysis. A range of general-purpose models is evaluated and the dataset is released.

Significance. If the fragment-level mechanism demonstrably improves detoxification outcomes, the benchmark could provide a useful, more interpretable resource for AI-assisted drug design. The focus on addressing data diversity and validity gaps in existing toxicity benchmarks is constructive, and releasing the dataset supports reproducibility.

major comments (2)
  1. [§4 and §5] §4 (Evaluation Methodology) and §5 (Results): The central claim that fragment-level editing enhances detoxification quality depends on the reliability of the proxy toxicity models used to label success. No correlation studies, calibration against experimental toxicity data, or wet-lab confirmation are reported, despite the abstract explicitly critiquing prior benchmarks for the same proxy reliance. If proxies misclassify toxicity (e.g., due to scaffold biases), gains in structural validity and proxy scores do not establish actual detoxification improvement.
  2. [§3] §3 (Benchmark Construction): Dataset statistics, diversity metrics, and validity rates for the MolDeTox split are not compared quantitatively to prior toxicity repair benchmarks, leaving the claim that the new benchmark overcomes their limitations unsupported by direct evidence.
minor comments (2)
  1. [Abstract] Abstract: Performance improvements are asserted without any numerical results, error bars, or dataset size summaries; a single sentence summarizing key metrics would improve clarity.
  2. [Figures/Tables] Figure and table captions: Ensure all figures reporting model comparisons include the exact proxy model names and thresholds used for toxicity labeling.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major point below with honest responses and indicate where revisions will be made to strengthen the work.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Evaluation Methodology) and §5 (Results): The central claim that fragment-level editing enhances detoxification quality depends on the reliability of the proxy toxicity models used to label success. No correlation studies, calibration against experimental toxicity data, or wet-lab confirmation are reported, despite the abstract explicitly critiquing prior benchmarks for the same proxy reliance. If proxies misclassify toxicity (e.g., due to scaffold biases), gains in structural validity and proxy scores do not establish actual detoxification improvement.

    Authors: We agree that proxy-based toxicity assessment is a limitation of the current work, as it is for most computational molecular benchmarks. Our primary contributions focus on demonstrating improvements in structural validity (measured via independent cheminformatics checks such as RDKit sanitization) and on providing interpretable fragment-level task analysis, which do not rely on the toxicity proxies. The stepwise editing mechanism itself is shown to increase the fraction of valid outputs compared to direct generation baselines. We acknowledge the inconsistency with our critique of prior work and will revise §4 and §5 to include an expanded limitations subsection that explicitly discusses proxy reliability, cites known calibration studies from the literature where available, and clarifies that our validity gains are proxy-independent. We cannot add new wet-lab experiments, as this is a benchmark and evaluation paper rather than an experimental study. revision: partial

  2. Referee: [§3] §3 (Benchmark Construction): Dataset statistics, diversity metrics, and validity rates for the MolDeTox split are not compared quantitatively to prior toxicity repair benchmarks, leaving the claim that the new benchmark overcomes their limitations unsupported by direct evidence.

    Authors: We thank the referee for highlighting this gap. In the revised manuscript we will add a dedicated comparison subsection (or table) in §3 that quantitatively reports dataset statistics including number of molecules, average Tanimoto diversity, scaffold diversity, and pre- and post-filtering validity rates, directly juxtaposed against the prior toxicity repair benchmarks referenced in the introduction. This will provide the direct evidence requested and better substantiate our claims regarding improved diversity and validity. revision: yes

standing simulated objections not resolved
  • Experimental wet-lab confirmation or new correlation studies against measured toxicity data, which lie outside the scope of this computational benchmark paper.

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark evaluation

full rationale

The paper proposes the MolDeTox benchmark for stepwise fragment-level molecular detoxification using LLMs and VLMs, then reports empirical results showing improved structural validity and quality. No equations, derivations, fitted parameters, or predictions appear that reduce to inputs by construction. Claims rest on direct evaluation against the new dataset and proxy metrics rather than any self-referential logic or self-citation chain. Proxy toxicity model reliance is a methodological assumption about data quality, not a circular step. The work is self-contained as an empirical study with no load-bearing self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark relies on standard assumptions about LLM molecular understanding via string or image representations and the reliability of existing toxicity proxy models; no new free parameters, axioms beyond domain norms, or invented entities are introduced.

axioms (1)
  • domain assumption LLMs and VLMs can meaningfully process and edit molecular structures represented as text (e.g., SMILES) or images for property optimization tasks
    Invoked when evaluating general-purpose models on fragment editing without additional molecular-specific training details.

pith-pipeline@v0.9.0 · 5522 in / 1296 out tokens · 54450 ms · 2026-05-13T05:19:21.650509+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 8 internal anchors

  1. [1]

    Training a scientific reasoning model for chemistry.arXiv preprint arXiv:2506.17238, 2025

    Siddharth M Narayanan, James D Braza, Ryan-Rhys Griffiths, Albert Bou, Geemi Wellawatte, Mayk Caldas Ramos, Ludovico Mitchener, Samuel G Rodriques, and Andrew D White. Training a scientific reasoning model for chemistry.arXiv preprint arXiv:2506.17238, 2025

  2. [2]

    ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge

    Zihan Zhao, Bo Chen, Ziping Wan, Lu Chen, Xuanze Lin, Shiyang Yu, Situo Zhang, Da Ma, Zichen Zhu, Danyang Zhang, et al. Chemdfm-r: A chemical reasoning llm enhanced with atomized chemical knowledge.arXiv preprint arXiv:2507.21990, 2025

  3. [3]

    Bioreason: Incentivizing multimodal biological reasoning within a dna-llm model.arXiv preprint arXiv:2505.23579, 2025

    Adibvafa Fallahpour, Andrew Magnuson, Purav Gupta, Shihao Ma, Jack Naimer, Arnav Shah, Haonan Duan, Omar Ibrahim, Hani Goodarzi, Chris J Maddison, et al. Bioreason: Incentivizing multimodal biological reasoning within a dna-llm model.arXiv preprint arXiv:2505.23579, 2025

  4. [4]

    Bioreason-pro: Advancing protein function prediction with multimodal biological reasoning.bioRxiv, pages 2026–03, 2026

    Adibvafa Fallahpour, Arman Seyed-Ahmadi, Parsa Idehpour, Omar Ibrahim, Purav Gupta, Jack Naimer, Kevin Zhu, Arnav Shah, Shihao Ma, Abhinav Adduri, et al. Bioreason-pro: Advancing protein function prediction with multimodal biological reasoning.bioRxiv, pages 2026–03, 2026

  5. [5]

    rbio1-training scientific reasoning llms with biological world models as soft verifiers.bioRxiv, pages 2025–08, 2025

    Ana-Maria Istrate, Fausto Milletari, Fabrizio Castrotorres, Jakub M Tomczak, Michaela Torkar, Donghui Li, and Theofanis Karaletsos. rbio1-training scientific reasoning llms with biological world models as soft verifiers.bioRxiv, pages 2025–08, 2025

  6. [6]

    Chemvlm: Exploring the power of multimodal large language models in chemistry area

    Junxian Li, Di Zhang, Xunzhi Wang, Zeying Hao, Jingdi Lei, Qian Tan, Cai Zhou, Wei Liu, Yaotian Yang, Xinrui Xiong, et al. Chemvlm: Exploring the power of multimodal large language models in chemistry area. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 415–423, 2025

  7. [7]

    Molvision: Molecular property prediction with vision language models.arXiv preprint arXiv:2507.03283, 2025

    Deepan Adak, Yogesh Singh Rawat, and Shruti Vyas. Molvision: Molecular property prediction with vision language models.arXiv preprint arXiv:2507.03283, 2025

  8. [8]

    Collaborative expert llms guided multi-objective molecular optimization.arXiv preprint arXiv:2503.03503, 2025

    Jiajun Yu, Yizhen Zheng, Huan Yee Koh, Shirui Pan, Tianyue Wang, and Haishuai Wang. Collaborative expert llms guided multi-objective molecular optimization.arXiv preprint arXiv:2503.03503, 2025

  9. [9]

    Lico: Large language models for in-context molecular optimization.arXiv preprint arXiv:2406.18851, 2024

    Tung Nguyen and Aditya Grover. Lico: Large language models for in-context molecular optimization.arXiv preprint arXiv:2406.18851, 2024

  10. [10]

    Drugassist: A large language model for molecule optimization.Briefings in Bioinformatics, 26(1):bbae693, 2025

    Geyan Ye, Xibao Cai, Houtim Lai, Xing Wang, Junhong Huang, Longyue Wang, Wei Liu, and Xiangxiang Zeng. Drugassist: A large language model for molecule optimization.Briefings in Bioinformatics, 26(1):bbae693, 2025

  11. [11]

    Mollangbench: A comprehensive benchmark for language-prompted molecular structure recognition, editing, and generation.arXiv preprint arXiv:2505.15054, 2025

    Feiyang Cai, Jiahui Bai, Tao Tang, Guijuan He, Joshua Luo, Tianyu Zhu, Srikanth Pilla, Gang Li, Ling Liu, and Feng Luo. Mollangbench: A comprehensive benchmark for language-prompted molecular structure recognition, editing, and generation.arXiv preprint arXiv:2505.15054, 2025

  12. [12]

    Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations.arXiv preprint arXiv:2505.21318, 2025

    Hao Li, He Cao, Bin Feng, Yanjun Shao, Xiangru Tang, Zhiyuan Yan, Li Yuan, Yonghong Tian, and Yu Li. Beyond chemical qa: Evaluating llm’s chemical reasoning with modular chemical operations.arXiv preprint arXiv:2505.21318, 2025. 10

  13. [13]

    Breaking bad molecules: are mllms ready for structure-level molecular detoxification?arXiv preprint arXiv:2506.10912, 2025

    Fei Lin, Ziyang Gong, Cong Wang, Tengchao Zhang, Yonglin Tian, Yining Jiang, Ji Dai, Chao Guo, Xiaotong Yu, Xue Yang, et al. Breaking bad molecules: are mllms ready for structure-level molecular detoxification?arXiv preprint arXiv:2506.10912, 2025

  14. [14]

    Gotta be safe: a new framework for molecular design.Digital Discovery, 3(4):796–804, 2024

    Emmanuel Noutahi, Cristian Gabellini, Michael Craig, Jonathan SC Lim, and Prudencio Tossou. Gotta be safe: a new framework for molecular design.Digital Discovery, 3(4):796–804, 2024

  15. [15]

    Designing around structural alerts in drug discovery.Journal of medicinal chemistry, 63(12):6276–6302, 2019

    Amit S Kalgutkar. Designing around structural alerts in drug discovery.Journal of medicinal chemistry, 63(12):6276–6302, 2019

  16. [16]

    Mt-mol: Multi agent system with tool-based reasoning for molecular optimization.Artificial Intelligence Repository, 2025

    Hyomin Kim, Yunhui Jang, and Sungsoo Ahn. Mt-mol: Multi agent system with tool-based reasoning for molecular optimization.Artificial Intelligence Repository, 2025

  17. [17]

    Large language models as tools for molecular toxicity prediction: Ai insights into cardiotoxicity.Journal of Chemical Information and Modeling, 65(5):2268–2282, 2025

    Hengzheng Yang, Jian Xiu, Weiqi Yan, Kaifeng Liu, Huizi Cui, Zhibang Wang, Qizheng He, Yilin Gao, and Weiwei Han. Large language models as tools for molecular toxicity prediction: Ai insights into cardiotoxicity.Journal of Chemical Information and Modeling, 65(5):2268–2282, 2025

  18. [18]

    Application of large language models in drug-induced osteotoxicity prediction.Journal of Chemical Information and Modeling, 65(7):3370–3379, 2025

    Yi-Qi Chen, Tao Yu, Zheng-Qi Song, Chen-Yu Wang, Jiang-Tao Luo, Yong Xiao, Heng Qiu, Qing-Qing Wang, and Hai-Ming Jin. Application of large language models in drug-induced osteotoxicity prediction.Journal of Chemical Information and Modeling, 65(7):3370–3379, 2025

  19. [19]

    Cotox: Chain-of-thought-based molecular toxicity reasoning and prediction

    Jueon Park, Yein Park, Minju Song, Soyon Park, Donghyeon Lee, Seungheun Baek, and Jaewoo Kang. Cotox: Chain-of-thought-based molecular toxicity reasoning and prediction. In2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 4002–4007. IEEE, 2025

  20. [20]

    ToxReason: A Benchmark for Mechanistic Chemical Toxicity Reasoning via Adverse Outcome Pathway

    Jueon Park, Wonjune Jang, Chanhwi Kim, Yein Park, and Jaewoo Kang. Toxreason: A benchmark for mechanistic chemical toxicity reasoning via adverse outcome pathway.arXiv preprint arXiv:2604.06264, 2026

  21. [21]

    Quantifying the chemical beauty of drugs.Nature chemistry, 4(2):90–98, 2012

    G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs.Nature chemistry, 4(2):90–98, 2012

  22. [22]

    Txgemma: Efficient and agentic llms for therapeutics

    Eric Wang, Samuel Schmidgall, Paul F Jaeger, Fan Zhang, Rory Pilgrim, Yossi Matias, Joelle Barral, David Fleet, and Shekoofeh Azizi. Txgemma: Efficient and agentic llms for therapeutics. arXiv preprint arXiv:2504.06196, 2025

  23. [23]

    Drug- induced liver injury severity and toxicity (dilist): binary classification of 1279 drugs by human hepatotoxicity.Drug discovery today, 25(1):201–208, 2020

    Shraddha Thakkar, Ting Li, Zhichao Liu, Leihong Wu, Ruth Roberts, and Weida Tong. Drug- induced liver injury severity and toxicity (dilist): binary classification of 1279 drugs by human hepatotoxicity.Drug discovery today, 25(1):201–208, 2020

  24. [24]

    Dictrank: The largest reference list of 1318 human drugs ranked by risk of drug-induced cardiotoxicity using fda labeling.Drug Discovery Today, 28(11):103770, 2023

    Yanyan Qu, Ting Li, Zhichao Liu, Dongying Li, and Weida Tong. Dictrank: The largest reference list of 1318 human drugs ranked by risk of drug-induced cardiotoxicity using fda labeling.Drug Discovery Today, 28(11):103770, 2023

  25. [25]

    Generation of a drug-induced renal injury list to facilitate the development of new approach methodologies for nephrotoxicity.Drug discovery today, 29(4):103938, 2024

    Skylar Connor, Ting Li, Yanyan Qu, Ruth A Roberts, and Weida Tong. Generation of a drug-induced renal injury list to facilitate the development of new approach methodologies for nephrotoxicity.Drug discovery today, 29(4):103938, 2024

  26. [26]

    Admet eval- uation in drug discovery

    Shuangquan Wang, Huiyong Sun, Hui Liu, Dan Li, Youyong Li, and Tingjun Hou. Admet eval- uation in drug discovery. 16. predicting herg blockers by combining multiple pharmacophores and machine learning approaches.Molecular pharmaceutics, 13(8):2855–2866, 2016

  27. [27]

    Fang Du, Haibo Yu, Beiyan Zou, Joseph Babcock, Shunyou Long, and Min Li. hergcentral: a large database to store, retrieve, and analyze compound-human ether-a-go-go related gene channel interactions to facilitate cardiotoxicity assessment in drug development.Assay and drug development technologies, 9(6):580–588, 2011

  28. [28]

    Cardiotox net: a robust pre- dictor for herg channel blockade based on deep learning meta-feature ensembles.Journal of cheminformatics, 13(1):60, 2021

    Abdul Karim, Matthew Lee, Thomas Balle, and Abdul Sattar. Cardiotox net: a robust pre- dictor for herg channel blockade based on deep learning meta-feature ensembles.Journal of cheminformatics, 13(1):60, 2021. 11

  29. [29]

    In silico prediction of chemical ames mutagenicity.Journal of chemical information and modeling, 52(11):2840–2847, 2012

    Congying Xu, Feixiong Cheng, Lei Chen, Zheng Du, Weihua Li, Guixia Liu, Philip W Lee, and Yun Tang. In silico prediction of chemical ames mutagenicity.Journal of chemical information and modeling, 52(11):2840–2847, 2012

  30. [30]

    Predicting chemically-induced skin reactions

    Vinicius M Alves, Eugene Muratov, Denis Fourches, Judy Strickland, Nicole Kleinstreuer, Carolina H Andrade, and Alexander Tropsha. Predicting chemically-induced skin reactions. part i: Qsar models of skin sensitization and their application to identify potentially hazardous compounds.Toxicology and applied pharmacology, 284(2):262–272, 2015

  31. [31]

    Ruili Huang, Menghang Xia, Dac-Trung Nguyen, Tongan Zhao, Srilatha Sakamuru, Jinghua Zhao, Sampada A Shahane, Anna Rossoshek, and Anton Simeonov. Tox21challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs.Frontiers in Environmental Science, 3:85, 2016

  32. [32]

    A data-driven approach to predicting successes and failures of clinical trials.Cell chemical biology, 23(10):1294–1301, 2016

    Kaitlyn M Gayvert, Neel S Madhukar, and Olivier Elemento. A data-driven approach to predicting successes and failures of clinical trials.Cell chemical biology, 23(10):1294–1301, 2016

  33. [33]

    Comprehensive characteriza- tion of cytochrome p450 isozyme selectivity across chemical libraries.Nature biotechnology, 27(11):1050–1055, 2009

    Henrike Veith, Noel Southall, Ruili Huang, Tim James, Darren Fayne, Natalia Artemenko, Min Shen, James Inglese, Christopher P Austin, David G Lloyd, et al. Comprehensive characteriza- tion of cytochrome p450 isozyme selectivity across chemical libraries.Nature biotechnology, 27(11):1050–1055, 2009

  34. [34]

    Targeted but troubling: Cyp450 inhibition by kinase and parp inhibitors and its clinical implications.Drugs and drug candidates, 4(2):24, 2025

    Martin Kondža, Josipa Buki´c, Ivan ´Cavar, and Biljana Tubi´c. Targeted but troubling: Cyp450 inhibition by kinase and parp inhibitors and its clinical implications.Drugs and drug candidates, 4(2):24, 2025

  35. [35]

    The sider database of drugs and side effects.Nucleic acids research, 44(D1):D1075–D1079, 2016

    Michael Kuhn, Ivica Letunic, Lars Juhl Jensen, and Peer Bork. The sider database of drugs and side effects.Nucleic acids research, 44(D1):D1075–D1079, 2016

  36. [36]

    Exposing the limitations of molecular machine learning with activity cliffs.Journal of chemical information and modeling, 62(23):5938–5951, 2022

    Derek Van Tilborg, Alisa Alenicheva, and Francesca Grisoni. Exposing the limitations of molecular machine learning with activity cliffs.Journal of chemical information and modeling, 62(23):5938–5951, 2022

  37. [37]

    Graphcliff: Short-long range gating for subtle differences but critical changes.arXiv preprint arXiv:2511.03170, 2025

    Hajung Kim, Jueon Park, Junseok Choe, Sheunheun Baek, Hyeon Hwang, and Jaewoo Kang. Graphcliff: Short-long range gating for subtle differences but critical changes.arXiv preprint arXiv:2511.03170, 2025

  38. [38]

    Matched molecular pair analysis in drug discovery: methods and recent applications.Journal of Medicinal Chemistry, 66(7):4361–4377, 2023

    Ziyi Yang, Shaohua Shi, Li Fu, Aiping Lu, Tingjun Hou, and Dongsheng Cao. Matched molecular pair analysis in drug discovery: methods and recent applications.Journal of Medicinal Chemistry, 66(7):4361–4377, 2023

  39. [39]

    An outliers detection and elimination framework in classification task of data mining.Decision Analytics Journal, 6:100164, 2023

    Ch Sanjeev Kumar Dash, Ajit Kumar Behera, Satchidananda Dehuri, and Ashish Ghosh. An outliers detection and elimination framework in classification task of data mining.Decision Analytics Journal, 6:100164, 2023

  40. [40]

    Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings.Advanced drug delivery reviews, 23(1-3):3–25, 1997

    Christopher A Lipinski, Franco Lombardo, Beryl W Dominy, and Paul J Feeney. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings.Advanced drug delivery reviews, 23(1-3):3–25, 1997

  41. [41]

    Molecular properties that influence the oral bioavailability of drug candidates

    Daniel F Veber, Stephen R Johnson, Hung-Yuan Cheng, Brian R Smith, Keith W Ward, and Ken- neth D Kopple. Molecular properties that influence the oral bioavailability of drug candidates. Journal of medicinal chemistry, 45(12):2615–2623, 2002

  42. [42]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  43. [43]

    Update to gpt-5 system card: Gpt-5.2, December 2025

    OpenAI. Update to gpt-5 system card: Gpt-5.2, December 2025. URL https://cdn.openai. com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf

  44. [44]

    Gemini 3.1 flash lite model card, 2025

    Google DeepMind. Gemini 3.1 flash lite model card, 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Flash-Lite-Model-Card.pdf. 12

  45. [45]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  46. [46]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  47. [47]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  48. [48]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  49. [49]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  50. [50]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  51. [51]

    Molrag: unlocking the power of large language models for molecular property prediction

    Ziting Xian, Jiawei Gu, Lingbo Li, and Shangsong Liang. Molrag: unlocking the power of large language models for molecular property prediction. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15513–15531, 2025

  52. [52]

    Knowmol: Advanc- ing molecular large language models with multi-level chemical knowledge.arXiv preprint arXiv:2510.19484, 2025

    Zaifei Yang, Hong Chang, Ruibing Hou, Shiguang Shan, and Xilin Chen. Knowmol: Advanc- ing molecular large language models with multi-level chemical knowledge.arXiv preprint arXiv:2510.19484, 2025

  53. [53]

    guardian of the genome

    Iurii Sushko, Elena Salmina, Vladimir A Potemkin, Gennadiy Poda, and Igor V Tetko. Toxalerts: a web server of structural alerts for toxic chemicals and compounds with potential adverse reactions, 2012. 13 AT oxicityCliffConstruction Details Table A: Statistics ofT oxicityCliff, including the number of endpoints, molecules, toxicity cliff pairs, and result...

  54. [54]

    - Identify the fragment(s) that are specific to the toxic molecule and are most likely responsible for the toxicity signal

    Focus ontoxicity-associated fragment identification. - Identify the fragment(s) that are specific to the toxic molecule and are most likely responsible for the toxicity signal. - If multiple fragments are required, return all of them. - Preserve the original SAFE fragment format exactly

  55. [55]

    - If there are multiple fragments, concatenate them as a dot-separated SAFE string

    Output format constraints: - Return the answer as thetoxic-only SAFE fragment string. - If there are multiple fragments, concatenate them as a dot-separated SAFE string. - Do not paraphrase fragment content or convert it into natural language

  56. [56]

    answer":

    Response format: { "answer": "..." } HARD CONSTRAINTS: - Output ONLY the JSON object. - Do not include explanations, markdown, or extra text. - Do not add any extra keys. - The value of "answer" must be the toxic-only SAFE fragment string exactly. Table H:Task 1 system prompt for toxic fragment identification. 34 Task 2SYSTEM PROMPT You are a molecular to...

  57. [57]

    - Generate the fragment(s) that can replace the toxic fragment(s) while reducing toxicity

    Focus onnon-toxic fragment generation. - Generate the fragment(s) that can replace the toxic fragment(s) while reducing toxicity. - If multiple fragments are required, return all of them. - Preserve the original SAFE fragment format exactly

  58. [58]

    - If there are multiple fragments, concatenate them as a dot-separated SAFE string

    Output format constraints: - Return the answer as thenon-toxic-only SAFE fragment string. - If there are multiple fragments, concatenate them as a dot-separated SAFE string. - Do not paraphrase fragment content or convert it into natural language

  59. [59]

    answer":

    Response format: { "answer": "..." } HARD CONSTRAINTS: - Output ONLY the JSON object. - Do not include explanations, markdown, or extra text. - Do not add any extra keys. - The value of "answer" must be the non-toxic-only SAFE fragment string exactly. Table I:Task 2 system prompt for non-toxic fragment generation. 35 Task 3 SMILES GenerationSYSTEM PROMPT ...

  60. [60]

    - Generate a chemically plausible non-toxic molecule

    Focus onnon-toxic molecule generation. - Generate a chemically plausible non-toxic molecule. - Reduce toxicity while preserving the original molecular characteristics as much as possible. - Return the final molecule, not intermediate fragments

  61. [61]

    - Do not return SAFE fragments

    Output format constraints: - Return the answer as asingle non-toxic molecule SMILES string. - Do not return SAFE fragments. - Do not return multiple candidates

  62. [62]

    answer":

    Response format: { "answer": "..." } HARD CONSTRAINTS: - Output ONLY the JSON object. - Do not include explanations, markdown, or extra text. - Do not add any extra keys. - The value of "answer" must be the final non-toxic molecule SMILES string. Table J:Task 3 system prompt for direct non-toxic molecule generation. 36 Task 3 SAFE GenerationSYSTEM PROMPT ...

  63. [63]

    - Generate the full SAFE representation of the resulting non-toxic molecule

    Focus onnon-toxic SAFE generation. - Generate the full SAFE representation of the resulting non-toxic molecule. - Reduce toxicity while preserving the original molecular characteristics as much as possible. - Return the complete molecule-level SAFE representation, not only edited fragments

  64. [64]

    - If multiple fragments are present, concatenate them as a dot-separated SAFE string

    Output format constraints: - Return the answer as thefull non-toxic SAFE string. - If multiple fragments are present, concatenate them as a dot-separated SAFE string. - Do not paraphrase fragment content or convert it into natural language

  65. [65]

    answer":

    Response format: { "answer": "..." } HARD CONSTRAINTS: - Output ONLY the JSON object. - Do not include explanations, markdown, or extra text. - Do not add any extra keys. - The value of "answer" must be the final full non-toxic SAFE string for the whole molecule. Table K:Task 3 system prompt for full non-toxic SAFE generation. 37 Task 3 Step-wise CoT SAFE...

  66. [66]

    - First identify the toxic fragment(s) most likely associated with toxicity

    Usestep-wise reasoning. - First identify the toxic fragment(s) most likely associated with toxicity. - Then generate the corresponding non-toxic replacement fragment(s). - Finally generate the full non-toxic molecule as a SAFE string

  67. [67]

    - The JSON must include the final answer and all required intermediate fields

    Output format constraints: - Return a single JSON object. - The JSON must include the final answer and all required intermediate fields. - Do not omit any required fields

  68. [68]

    answer":

    Response format: { "answer": "...", "step1_only_toxic_safe_fragments": "...", "step1_reasoning": "...", "step2_only_nontoxic_safe_fragments": "...", "step2_reasoning": "...", "step3_reasoning": "..." } HARD CONSTRAINTS: - Output ONLY the JSON object. - Do not include markdown or extra text outside the JSON. - The value of "answer" must be the final full n...