arxiv: 2604.18316 · v2 · submitted 2026-04-20 · 🧬 q-bio.OT · cs.LG

Recognition: unknown

Predictive Modelling of Natural Medicinal Compounds for Alzheimer disease Using Machine Learning and Cheminformatics

Hafiza Syeda Yusra Tirmizi, Muhammad Faris, Rabail Khowaja, Saad Abdullah, Syed Ibad Hasnain

Pith reviewed 2026-05-10 02:59 UTC · model grok-4.3

classification 🧬 q-bio.OT cs.LG

keywords machine learningnatural compoundsAlzheimer diseasecheminformaticsRandom Forestmolecular descriptorsneuroprotective activity

0 comments

The pith

Machine learning models using molecular descriptors can predict neuroprotective activity of natural compounds for Alzheimer disease.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds classification models to flag natural compounds as likely active or inactive against Alzheimer disease based only on easy-to-compute chemical properties. It pulls labeled molecules from public databases, calculates descriptors such as molecular weight, lipophilicity, and polar surface area, trains several algorithms, and reports that ensemble methods reach the highest accuracy and ROC-AUC scores. Feature importance analysis points to lipophilicity, size, and polarity as the main drivers of predicted activity. A sympathetic reader would care because the method offers an inexpensive filter for large natural-product collections before any laboratory testing begins.

Core claim

The study shows that Random Forest and similar ensemble classifiers achieve the best accuracy and ROC-AUC when distinguishing active from inactive natural compounds using RDKit-derived physicochemical descriptors. Feature importance analysis identifies lipophilicity, molecular weight, and polarity as the most influential properties for the predicted neuroprotective effects.

What carries the argument

Random Forest ensemble classifier trained on RDKit-computed molecular descriptors (molecular weight, LogP, TPSA, hydrogen-bond counts) from ChEMBL- and PubChem-labeled natural compounds.

If this is right

Large natural-product libraries can be screened for dementia activity using only computed chemical properties.
Laboratory resources can be directed first toward compounds the model scores as high-probability actives.
Lipophilicity and polarity can serve as primary criteria when designing or selecting new neuroprotective candidates.
The same descriptor-based workflow can be repeated for other neurodegenerative conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the trained model generalizes to compounds outside the original databases, it could cut the volume of initial high-throughput assays required.
Pairing the classifier with structure-based docking or similarity searches could further narrow the candidate list before synthesis or purchase.
Public databases become more valuable when used to bootstrap predictive filters rather than only as lookup tables.

Load-bearing premise

Labels for active and inactive compounds taken from public databases correctly represent true anti-dementia biological activity without major errors or bias.

What would settle it

Laboratory testing of a sample of compounds the Random Forest model ranks as highly active, using a standard cell-based or animal model of Alzheimer disease, shows no neuroprotective effect above controls.

Figures

Figures reproduced from arXiv: 2604.18316 by Hafiza Syeda Yusra Tirmizi, Muhammad Faris, Rabail Khowaja, Saad Abdullah, Syed Ibad Hasnain.

**Figure 2.** Figure 2: Comparison of Key Molecular Descriptors Between Active and Inactive Compounds (Lipinski’s Rule [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Correlation Matrix of Molecular Descriptors for Natural Compounds [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Scatter plot showing the distribution of active (anti-Alzheimer) and inactive compounds in chemical space [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: ML Model Performance comparison 8 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: ROC Comparison of all ML Models [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Confusion Matrix Comparison 9 [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Feature Importance using Random Forest Network demonstrated similar performance but with slightly higher misclassification, suggesting less stability compared to the other models. The ROC curve analysis further supports these findings, where Random Forest, Logistic Regression, and MLP Neural Network achieved the highest AUC values (∼0.837), indicating good discrimination between active and inactive compoun… view at source ↗

**Figure 9.** Figure 9: Predicted Anti-AD Activity Probability high confidence, suggesting that it may not exhibit significant anti-dementia activity within the context of the trained model. These results highlight the model’s ability to distinguish between likely active and inactive compounds based on learned molecular features. Overall, the prediction results demonstrate the practical applicability of the developed model for sc… view at source ↗

read the original abstract

Alzheimer disease (AD) is a neurodegenerative disease that lacks specific treatment options. Natural drugs have displayed neuroprotective effects; however, their high-throughput discovery is challenging because of the expense of experimental testing.The study proposed a machine learning approach to identify the anti-dementia activity of natural compounds based on molecular descriptors obtained from cheminformatics. The study used a set of active and inactive compounds obtained from public databases like ChEMBL and PubChem. Various molecular descriptors, including molecular weight, lipophilicity (LogP), topological polar surface area (TPSA), and hydrogen bonding descriptors, were calculated with RDKit. Data preprocessing and feature selection were applied, followed by the development of several classification models (Random Forest, XGBoost, Support Vector Machines, Logistic Regression) and their evaluation based on accuracy, precision, recall, F1-score and ROC-AUC. The outcome suggests that ensemble techniques, such as Random Forest, delivered the best predictive accuracy and ROC-AUC values. This study also highlights that critical physicochemical descriptors in particular lipophilicity, molecular weight and polarity are important in driving neuroprotective activity as identified by feature importance analysis. The integrated machine learning approach shows the potential of combining natural product research and machine learning in early drug discovery for dementia. They provide a means of rapidly exploring large datasets and selecting candidates for experimental confirmation, thus minimising costs and time in the development of drugs for neurodegenerative diseases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard ML pipeline on database-labeled natural compounds for AD activity prediction, but no performance numbers or label validation details are given.

read the letter

This paper collects natural compounds labeled active or inactive for Alzheimer disease from ChEMBL and PubChem, computes standard RDKit descriptors such as molecular weight, LogP, TPSA and hydrogen-bond counts, then trains Random Forest, XGBoost, SVM and logistic regression models to predict those labels. It reports that Random Forest gives the best results and that lipophilicity, size and polarity rank highest in feature importance. That is the core of the work. The approach follows a common cheminformatics template that has already been applied to other targets, so the contribution is mainly the domain-specific run rather than new methods or theory. They do at least compare several classifiers and include feature selection, which keeps the pipeline from being completely trivial. The soft spots are more serious. The abstract supplies zero quantitative results—no compound counts, no accuracy or AUC values, no cross-validation scheme, no baseline comparisons. Without those numbers it is impossible to judge whether Random Forest actually outperforms the others or whether the feature rankings reflect anything beyond general drug-like properties. The stress-test note on label noise is on target: database entries often derive from single-target assays with varying thresholds, and compounds without any measurement are treated as inactive. Models trained on that data can easily pick up spurious correlations instead of true neuroprotective signals. The paper does not appear to filter assay types or provide an independent validation set, so the claimed superiority and mechanistic insights rest on unverified assumptions. This is the sort of paper that might interest someone already running similar screens for natural products in neurodegeneration and looking for one more data point. It will not move the field or give a reader new tools. I would not cite it in my own work. It could still go to peer review if the authors add the missing dataset statistics, exact performance figures, and a clear discussion of how they handled label quality and potential assay heterogeneity; otherwise the claims stay too thin to evaluate.

Referee Report

2 major / 1 minor

Summary. The manuscript outlines a standard cheminformatics-ML pipeline to classify natural compounds for anti-dementia (neuroprotective) activity. Active and inactive labels are taken from ChEMBL and PubChem; RDKit descriptors (molecular weight, LogP, TPSA, hydrogen-bond counts) are computed; data are preprocessed and features selected; four classifiers (Random Forest, XGBoost, SVM, Logistic Regression) are trained and ranked by accuracy, precision, recall, F1-score and ROC-AUC. The central claims are that Random Forest yields the highest performance and that lipophilicity, molecular weight and polarity are the most important drivers of activity according to feature-importance analysis.

Significance. If the performance numbers and feature rankings survive rigorous validation, the work supplies a practical, low-cost virtual screen for prioritizing natural-product libraries in Alzheimer’s drug discovery. The use of public databases and an ensemble of standard models is a modest but reproducible contribution that could reduce the experimental burden on natural-product screening. Credit is due for the explicit comparison of multiple classifiers and for highlighting physicochemical trends that align with known CNS drug-likeness rules.

major comments (2)

[Abstract and Methods] Abstract and Methods (Data Acquisition): the binary active/inactive labels are taken directly from ChEMBL and PubChem without any description of the underlying assays, IC50 thresholds, or phenotypic versus target-based criteria. Because the central claim—that Random Forest is superior and that LogP/MW/TPSA drive neuroprotective activity—rests entirely on these labels, the absence of assay-type filtering or confirmation of true negatives is load-bearing. Noisy or assay-specific labels would inflate both accuracy and the reported feature importances.
[Results] Results section: the abstract asserts that Random Forest delivered the best accuracy and ROC-AUC, yet supplies neither dataset size, active/inactive ratio, cross-validation scheme, nor numerical performance values. Without these quantities it is impossible to judge whether the claimed superiority is statistically meaningful or merely reflects an imbalanced or over-fitted training set.

minor comments (1)

[Abstract] Abstract: inclusion of at least the final dataset size, the best ROC-AUC value, and the top-three feature-importance ranks would allow readers to assess the claims immediately.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback, which has helped us identify areas where the manuscript can be strengthened for clarity and scientific rigor. We address each major comment point by point below and will incorporate the necessary revisions into the next version of the manuscript.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods (Data Acquisition): the binary active/inactive labels are taken directly from ChEMBL and PubChem without any description of the underlying assays, IC50 thresholds, or phenotypic versus target-based criteria. Because the central claim—that Random Forest is superior and that LogP/MW/TPSA drive neuroprotective activity—rests entirely on these labels, the absence of assay-type filtering or confirmation of true negatives is load-bearing. Noisy or assay-specific labels would inflate both accuracy and the reported feature importances.

Authors: We agree that explicit details on label derivation are essential for validating the central claims. The original manuscript summarized the sources but did not elaborate on assay specifics. In the revised version, we will expand the Data Acquisition section to specify the ChEMBL and PubChem query criteria, including IC50 thresholds for active compounds (e.g., IC50 ≤ 10 μM where available), assay types (target-based vs. phenotypic), and any steps taken to confirm true negatives or filter noisy data. This addition will directly address concerns about label reliability and support the reported feature importances. revision: yes
Referee: [Results] Results section: the abstract asserts that Random Forest delivered the best accuracy and ROC-AUC, yet supplies neither dataset size, active/inactive ratio, cross-validation scheme, nor numerical performance values. Without these quantities it is impossible to judge whether the claimed superiority is statistically meaningful or merely reflects an imbalanced or over-fitted training set.

Authors: We acknowledge that the absence of these quantitative details limits the ability to assess model performance rigorously. The current manuscript provides only qualitative statements about Random Forest superiority. In the revision, we will add to the Results section (and update the abstract if space permits) the exact dataset size, active/inactive class ratio, cross-validation procedure (e.g., stratified 5-fold CV with hyperparameter tuning), and all numerical metrics (accuracy, ROC-AUC, precision, recall, F1) for each classifier. We will also include a brief discussion of class imbalance handling and any statistical comparisons to confirm the significance of Random Forest's performance. revision: yes

Circularity Check

0 steps flagged

No circularity: standard ML training on external database labels

full rationale

The paper applies off-the-shelf classifiers (Random Forest, XGBoost, etc.) to binary activity labels sourced from ChEMBL and PubChem together with RDKit-computed physicochemical descriptors. Performance metrics and feature-importance rankings are obtained by standard train/test splits and cross-validation on those external labels; no internal equation, normalization, or self-citation is used to define the target variable or to force the reported accuracy/ROC-AUC values. The derivation chain is therefore self-contained against independent data sources and does not reduce to any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the reliability of database-derived activity labels and the sufficiency of a small set of 2D RDKit descriptors; no new physical entities are postulated.

free parameters (2)

Model hyperparameters
Hyperparameters for Random Forest, XGBoost, SVM and logistic regression are tuned during training but not reported.
Feature selection criteria
Thresholds or methods used to retain or discard molecular descriptors after calculation are not specified.

axioms (2)

domain assumption Activity labels from ChEMBL and PubChem accurately reflect true neuroprotective activity without significant noise or bias.
The paper treats these public-database annotations as ground truth for training and evaluation.
domain assumption The four classes of RDKit descriptors (MW, LogP, TPSA, H-bond counts) capture the physicochemical features that determine activity.
No justification is given for why other descriptors (e.g., 3D shape, pharmacophore features) were omitted.

pith-pipeline@v0.9.0 · 5576 in / 1794 out tokens · 49360 ms · 2026-05-10T02:59:09.281336+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references

[1]

Alzheimer’s disease drug development pipeline: 2019,

J. Cummings, G. Lee, A. Ritter, M. Sabbagh, and K. Zhong, “Alzheimer’s disease drug development pipeline: 2019,”Alzheimer’s & Dementia: Translational Research & Clinical Interventions, vol. 5, pp. 272–293, 2019

2019
[2]

Role of chemoinformatics and machine learning in drug repurposing,

F. Sirci and E. Guney, “Role of chemoinformatics and machine learning in drug repurposing,”Drug Repurposing, vol. 2, no. 1, p. 20250005, 2025. 11 arXivTemplateA PREPRINT

2025
[3]

A review of the current status of disease-modifying therapies and prevention of Alzheimer’s disease,

D. V . Parums, “A review of the current status of disease-modifying therapies and prevention of Alzheimer’s disease,”Medical Science Monitor, vol. 30, pp. e945091-1, 2024

2024
[4]

The Alzheimer’s disease drug development landscape,

P. Van Bokhovenet al., “The Alzheimer’s disease drug development landscape,”Alzheimer’s Research & Therapy, vol. 13, no. 1, p. 186, 2021

2021
[5]

The role of natural product chemistry in drug discovery: Two decades of progress and perspectives,

M. S. Butler and J. J. La Clair, “The role of natural product chemistry in drug discovery: Two decades of progress and perspectives,”Journal of Natural Products, 2025

2025
[6]

Simulation-based machine learning approach to classify accelerated biological aging,

S. I. Hasnainet al., “Simulation-based machine learning approach to classify accelerated biological aging,”Sir Syed University Research Journal, vol. 15, no. 2, pp. 23–31, 2025

2025
[7]

QSAR-based virtual screening: Advances and applications in drug discovery,

B. J. Neveset al., “QSAR-based virtual screening: Advances and applications in drug discovery,”Frontiers in Pharmacology, vol. 9, p. 1275, 2018

2018
[8]

Natural products as sources of new drugs,

D. J. Newman and G. M. Cragg, “Natural products as sources of new drugs,”Journal of Natural Products, vol. 83, no. 3, pp. 770–803, 2020

2020
[9]

Molecular descriptors as useful tools,

A. Ion, M. Praisler, and S. Gosav, “Molecular descriptors as useful tools,”Annals of the University of Galati, vol. 44, no. 1, pp. 26–29, 2021

2021
[10]

Evaluation of machine learning methods for bipolar disorder detection,

S. I. Hasnainet al., “Evaluation of machine learning methods for bipolar disorder detection,”VF AST Transactions on Software Engineering, vol. 13, no. 3, pp. 129–139, 2025

2025
[11]

Natural product for the treatment of Alzheimer’s disease,

T. T. Bui and T. H. Nguyen, “Natural product for the treatment of Alzheimer’s disease,”J. Basic Clin. Physiol. Pharmacol., vol. 28, no. 5, pp. 413–423, 2017

2017
[12]

Quantitative structure-activity relationship modeling,

S. C. Peteret al., “Quantitative structure-activity relationship modeling,”Encyclopedia of Bioinformatics, pp. 661–676, 2019

2019
[13]

Multi-dimensional QSAR in drug research,

A. Vedani and M. Dobler, “Multi-dimensional QSAR in drug research,”Progress in Drug Research, pp. 105–135, 2000

2000
[14]

In silico drug discovery: A machine learning-driven review,

S. Atasever, “In silico drug discovery: A machine learning-driven review,”Medicinal Chemistry Research, vol. 33, no. 9, pp. 1465–1490, 2024

2024
[15]

AI and ML driven drug discovery advancements,

D. D. Patelet al., “AI and ML driven drug discovery advancements,”Current Topics in Medicinal Chemistry, 2025

2025
[16]

Harnessing machine learning for drug discovery,

A. Husnainet al., “Harnessing machine learning for drug discovery,”Int. J. Multidisciplinary Sciences, vol. 2, no. 4, pp. 149–157, 2023

2023
[17]

Prediction of chemical compounds using deep learning,

M. Galushkaet al., “Prediction of chemical compounds using deep learning,”Neural Computing and Applications, vol. 33, no. 20, pp. 13345–13366, 2021

2021
[18]

Feature selection for forecasting models,

L. Zhang and J. Wen, “Feature selection for forecasting models,”Energy and Buildings, vol. 183, pp. 428–442, 2019

2019
[19]

Building predictive models via feature synthesis,

I. Arnaldo, U.-M. O’Reilly, and K. Veeramachaneni, “Building predictive models via feature synthesis,” inProc. GECCO, 2015, pp. 983–990

2015
[20]

Analysis of QSAR research using machine learning,

M. R. Keyvanpour and M. B. Shirzad, “Analysis of QSAR research using machine learning,”Current Drug Discovery Technologies, vol. 18, no. 1, pp. 17–30, 2021

2021
[21]

Developing QSAR models using machine learning,

Z. Wang, J. Chen, and H. Hong, “Developing QSAR models using machine learning,”Environmental Science & Technology, vol. 55, no. 10, pp. 6857–6866, 2021

2021
[22]

Evolution of QSAR studies with machine learning,

T. A. Soareset al., “Evolution of QSAR studies with machine learning,”ACS Publications, vol. 62, pp. 5317–5320, 2022

2022
[23]

Predicting performance using SHAP,

H. Sahlaouiet al., “Predicting performance using SHAP,”IEEE Access, vol. 9, pp. 152688–152703, 2021

2021
[24]

Blood–brain barrier challenges,

L. A. Bors and F. Erd ˝o, “Blood–brain barrier challenges,”Scientia Pharmaceutica, vol. 87, no. 1, p. 6, 2019

2019
[25]

AI for natural products in neurodegeneration therapies,

F. Fontanellaet al., “AI for natural products in neurodegeneration therapies,”Biomolecules, vol. 16, no. 1, p. 129, 2026

2026
[26]

Classification of dopamine receptor ligands using ML,

S. Suprapto and Y . L. Ni’mah, “Classification of dopamine receptor ligands using ML,”Research Journal of Pharmacy and Technology, vol. 17, no. 9, pp. 4507–4514, 2024

2024
[27]

AI in drug discovery and clinical relevance,

R. Qureshiet al., “AI in drug discovery and clinical relevance,”Heliyon, vol. 9, no. 7, 2023. 12

2023