What Molecular Structure Cannot Tell Us: A Taxonomy of Explainability Gaps in GNN-Based Drug Toxicity Prediction
Pith reviewed 2026-06-29 19:32 UTC · model grok-4.3
The pith
Molecular graphs capture only 5 of 11 known adverse effects of aspirin.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A Message Passing Neural Network trained on Tox21 data, followed by GNNExplainer atom attributions, attributes only five of eleven documented adverse effects of acetylsalicylic acid to its molecular graph. The study introduces a Gap Taxonomy with four categories: principally non-encodable effects (GAP-1), data gaps from Missing Not At Random mechanisms (GAP-2), assay panel mismatches (GAP-3), and representation errors (GAP-4). The MNAR component is quantified by a ChEMBL query that returns zero bioactivity entries across 42 assays, and an attention pooling test localizes representation error to the message passing layers.
What carries the argument
The Gap Taxonomy (GAP-1 to GAP-4) that partitions structural information limits into non-encodable effects, MNAR data gaps, assay mismatches, and message-passing representation errors.
If this is right
- Only the subset of adverse effects that are structurally inferable can be addressed by any graph-based toxicity model.
- The taxonomy applies to every structure-based prediction method, not only GNNs.
- Complete drug safety assessment requires data sources beyond molecular graphs for the remaining gap categories.
- Representation error in MPNNs arises inside the message passing layers rather than during final aggregation.
Where Pith is reading between the lines
- The same gap categories are likely to appear for other well-characterized drugs.
- Safety models could be extended to output both a prediction and the gap category that applies to each endpoint.
- Regulatory frameworks may need explicit provisions for effects that lie outside structural inference.
Load-bearing premise
The set of eleven known adverse effects for aspirin is treated as complete and representative, and GNNExplainer attributions are assumed to correctly separate structurally inferable effects from the rest.
What would settle it
Re-running the attribution analysis with a different explainer or larger adverse-effect list that attributes more than five of the eleven effects to structural features.
Figures
read the original abstract
Not all clinically relevant adverse effects are structurally inferable from molecular graphs - regardless of model quality or architectural complexity. This study introduces an operational taxonomy of the structural information limits that prevent structure-based toxicity prediction, independent of the learning algorithm employed. Graph Neural Networks (GNNs) have emerged as a natural approach for molecular toxicity prediction, operating directly on atomic connectivity without the information loss inherent to fixed-length fingerprints. However, the fraction of a drug's known pharmacological profile that is actually inferable from molecular structure remains systematically underexplored. A systematic case study using acetylsalicylic acid (ASA, Aspirin) - one of the most comprehensively characterized drugs in pharmacology - serves as model compound. A Message Passing Neural Network (MPNN) is trained on the Tox21 benchmark and GNNExplainer is applied to characterize atom-level attribution. Results indicate that molecular structure explains approximately 45% (5/11) of known ASA adverse effects. A four-category Gap Taxonomy (GAP-1 through GAP-4) is introduced distinguishing between principally non-encodable effects, data gaps arising from Missing Not At Random (MNAR) mechanisms, assay panel mismatches, and representation errors. The MNAR gap is empirically quantified via a systematic ChEMBL query (42 documented assays, 0 retrievable bioactivity entries). An attention pooling experiment localizes the representation error to the MPNN message passing layers rather than the aggregation step. The Gap Taxonomy has direct implications for drug safety signal detection and regulatory frameworks including Good Pharmacovigilance Practice (GVP) guidelines and New Approach Methodologies (NAMs). Structural limits identified are confirmed in a companion DDI ablation study.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that molecular structure is fundamentally limited in explaining drug adverse effects, independent of model architecture. Using acetylsalicylic acid (ASA) as a case study, an MPNN trained on Tox21 is paired with GNNExplainer to conclude that structure accounts for only ~45% (5/11) of known ASA adverse effects. A four-category Gap Taxonomy (GAP-1: principally non-encodable; GAP-2: MNAR data gaps; GAP-3: assay panel mismatches; GAP-4: representation errors) is introduced, with the MNAR component quantified by a ChEMBL query (42 assays, zero bioactivity entries) and representation error localized to message-passing layers via an attention-pooling ablation. The taxonomy is presented as algorithm-independent and relevant to GVP guidelines and NAMs, with confirmation claimed from a companion DDI study.
Significance. If the taxonomy and 45% quantification prove robust across models, the work would usefully bound expectations for structure-based toxicity models and inform when negative predictions should not be over-interpreted. Credit is due for the systematic ChEMBL query that concretely demonstrates an MNAR gap and for the attention-pooling experiment that isolates the representation issue to the message-passing stage rather than aggregation. These elements provide a reproducible template for similar gap analyses even if the specific 5/11 split requires further grounding.
major comments (2)
- [Abstract and Results] Abstract and Results: The central claim that the Gap Taxonomy is 'independent of the learning algorithm employed' is undercut by the empirical 5/11 quantification, which rests exclusively on GNNExplainer attributions from one MPNN trained on Tox21. Absence of atom attributions does not demonstrate that an effect is principally non-encodable from structure; it may reflect incomplete coverage of relevant mechanisms in Tox21, insufficient propagation in message passing, or known incompleteness of GNNExplainer. No cross-model or cross-explainer validation is reported to support algorithm independence.
- [Results (ASA case study)] Results (ASA case study): The classification of 6 of the 11 adverse effects as non-inferable depends on post-hoc assignment of GNNExplainer outputs to taxonomy categories without reported inter-rater reliability, sensitivity analysis on explainer hyperparameters, or comparison against a null model. This makes the 45% figure and the operational definitions of GAP-1 through GAP-4 load-bearing yet fragile; the taxonomy categories appear tailored to the observed attributions rather than independently validated.
minor comments (1)
- [Abstract] Abstract: The companion DDI ablation study is invoked for confirmation but receives no methodological detail or citation; if present in the full manuscript, ensure it is explicitly cross-referenced so readers can assess whether it addresses the model-dependence concern.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for highlighting the value of the ChEMBL query and attention-pooling ablation. We address the two major comments point by point below, proposing targeted revisions to strengthen the manuscript while preserving its core contribution.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results: The central claim that the Gap Taxonomy is 'independent of the learning algorithm employed' is undercut by the empirical 5/11 quantification, which rests exclusively on GNNExplainer attributions from one MPNN trained on Tox21. Absence of atom attributions does not demonstrate that an effect is principally non-encodable from structure; it may reflect incomplete coverage of relevant mechanisms in Tox21, insufficient propagation in message passing, or known incompleteness of GNNExplainer. No cross-model or cross-explainer validation is reported to support algorithm independence.
Authors: We agree that the specific 5/11 mapping in the ASA case study depends on attributions from a single MPNN and GNNExplainer, and that non-attribution cannot alone prove a gap is principally non-encodable. The taxonomy categories themselves are defined from pharmacological mechanisms, data availability patterns, and assay characteristics rather than from any particular model's outputs. However, the empirical illustration of the taxonomy does rely on this one pipeline. We will revise the abstract, introduction, and discussion to state that the taxonomy is proposed as a general conceptual framework whose categories are intended to be model-agnostic, while clarifying that the 5/11 quantification is an illustrative case study whose precise split may vary with model or explainer choice. A new limitations paragraph will explicitly note the absence of cross-model validation and recommend it as future work. revision: partial
-
Referee: [Results (ASA case study)] Results (ASA case study): The classification of 6 of the 11 adverse effects as non-inferable depends on post-hoc assignment of GNNExplainer outputs to taxonomy categories without reported inter-rater reliability, sensitivity analysis on explainer hyperparameters, or comparison against a null model. This makes the 45% figure and the operational definitions of GAP-1 through GAP-4 load-bearing yet fragile; the taxonomy categories appear tailored to the observed attributions rather than independently validated.
Authors: The assignments were performed by the authors using explicit criteria linking each adverse effect's known mechanism to the presence or absence of relevant atom attributions and to the four gap definitions. We acknowledge that no inter-rater reliability statistic, hyperparameter sensitivity sweep, or null-model comparison was reported. We will add a sensitivity analysis varying GNNExplainer hyperparameters (learning rate, epochs, and mask size) and report how the 5/11 count changes. We will also include a brief comparison against a random-attribution baseline. Because the taxonomy was developed from first principles before the case study was run, we do not believe the categories were post-hoc tailored, but we will make the development sequence clearer in the methods. revision: yes
Circularity Check
No significant circularity; empirical results rest on external data and model application
full rationale
The paper's 5/11 quantification and Gap Taxonomy derive from applying GNNExplainer attributions to an MPNN trained on the independent Tox21 benchmark, combined with a direct ChEMBL database query for MNAR effects and an attention pooling ablation. These steps use external benchmarks and observed model outputs rather than defining the taxonomy or results in terms of themselves. No self-citations, fitted parameters renamed as predictions, or definitional loops appear in the derivation chain. The taxonomy organizes the empirical gaps without forcing the 45% figure by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 11 documented adverse effects for ASA constitute a complete and unbiased reference set for measuring structural explainability.
invented entities (1)
-
GAP-1 to GAP-4 taxonomy
no independent evidence
Forward citations
Cited by 1 Pith paper
-
From Detection to Mechanism: Cross-Attention Graph Neural Networks Enable Drug-Drug Interaction Type Prediction An Ablation Study with Acetylsalicylic Acid Validation
Cross-attention GNNs raise multi-class F1-macro for DDI type prediction by 0.186 over concatenation baselines on a 38k-pair benchmark, with 10/10 held-out ASA validation success versus 0/10 for a ternary model.
Reference graph
Works this paper leans on
-
[1]
Geometric deep learning: Going beyond Euclidean data.IEEE Signal Processing Magazine, 34(4):18–42, 2017
Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: Going beyond Euclidean data.IEEE Signal Processing Magazine, 34(4):18–42, 2017
2017
-
[2]
Convolutional networks on graphs for learning molecular fingerprints
David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, et al. Convolutional networks on graphs for learning molecular fingerprints. InAdvances in Neural Information Processing Systems, volume 28, 2015
2015
-
[3]
Neural message passing for quantum chemistry
Justin Gilmer, Stephan S Schütt, George E Dahl, Oriol Vinyals, and Patrick Riley. Neural message passing for quantum chemistry. InProceedings of the 34th International Conference on Machine Learning, pages 1263–1272, 2017
2017
-
[4]
DeepTox: Toxicity prediction using deep learning.Frontiers in Environmental Science, 3:80, 2016
Andreas Mayr, Günter Klambauer, Thomas Unterthiner, and Sepp Hochreiter. DeepTox: Toxicity prediction using deep learning.Frontiers in Environmental Science, 3:80, 2016
2016
-
[5]
Analyzing learned molecular representations for property prediction.Journal of Chemical Information and Modeling, 59(8):3370–3388, 2019
Kevin Yang, Kyle Swanson, Wengong Jin, et al. Analyzing learned molecular representations for property prediction.Journal of Chemical Information and Modeling, 59(8):3370–3388, 2019
2019
-
[6]
GN- NExplainer: Generating explanations for graph neural networks
Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. GN- NExplainer: Generating explanations for graph neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019
2019
-
[7]
Inference and missing data.Biometrika, 63(3):581–592, 1976
Donald B Rubin. Inference and missing data.Biometrika, 63(3):581–592, 1976
1976
-
[8]
Time-split cross-validation as a method for estimating the goodness of prospective prediction.Journal of Chemical Information and Modeling, 53(4):783–790, 2013
Robert P Sheridan. Time-split cross-validation as a method for estimating the goodness of prospective prediction.Journal of Chemical Information and Modeling, 53(4):783–790, 2013
2013
-
[9]
Predictive multitask deep neural net- work models for ADME-Tox properties.Journal of Chemical Information and Modeling, 59(3):1253–1268, 2019
Jan Wenzel, Hans Matter, and Frank Schmidt. Predictive multitask deep neural net- work models for ADME-Tox properties.Journal of Chemical Information and Modeling, 59(3):1253–1268, 2019
2019
-
[10]
Gated Graph Sequence Neural Networks
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks.arXiv preprint arXiv:1511.05493, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[11]
ChEMBL: towards direct deposition of bioassay data.Nucleic Acids Research, 47(D1):D930–D940, 2019
David Mendez, Anna Gaulton, A Patricia Bento, et al. ChEMBL: towards direct deposition of bioassay data.Nucleic Acids Research, 47(D1):D930–D940, 2019
2019
-
[12]
RDKit: Open-source cheminformatics.rdkit.org, 2006
Greg Landrum. RDKit: Open-source cheminformatics.rdkit.org, 2006
2006
-
[13]
DrugBank 5.0: A major update to the DrugBank database for 2018.Nucleic Acids Research, 46(D1):D1074–D1082, 2018
David S Wishart, Yannick D Feunang, An Chi Guo, et al. DrugBank 5.0: A major update to the DrugBank database for 2018.Nucleic Acids Research, 46(D1):D1074–D1082, 2018
2018
-
[14]
J. Dietrich. From detection to mechanism: Cross-attention graph neural networks enable drug-drug interaction type prediction — an ablation study with acetylsalicylic acid validation. arXiv preprint arXiv:2605.27861, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Optimizing drug-target interaction prediction with federated learning.Nature Communications, 14:4064, 2023
Martijn Oldenhof, Adam Arany, Yves Moreau, and Jaak Simm. Optimizing drug-target interaction prediction with federated learning.Nature Communications, 14:4064, 2023. 13
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.