What Molecular Structure Cannot Tell Us: A Taxonomy of Explainability Gaps in GNN-Based Drug Toxicity Prediction

Juergen Dietrich

arxiv: 2605.26183 · v2 · pith:DDMNN3J6new · submitted 2026-05-25 · 🧬 q-bio.QM · cs.LG

What Molecular Structure Cannot Tell Us: A Taxonomy of Explainability Gaps in GNN-Based Drug Toxicity Prediction

Juergen Dietrich This is my paper

Pith reviewed 2026-06-29 19:32 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.LG

keywords explainability gapsGNNdrug toxicity predictionmolecular graphsadverse effectsaspiringap taxonomyMNAR

0 comments

The pith

Molecular graphs capture only 5 of 11 known adverse effects of aspirin.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that molecular structure alone cannot explain all clinically observed adverse drug effects, regardless of how sophisticated the prediction model is. Through a detailed case study of aspirin, it finds that only about 45 percent of the drug's known side effects can be traced back to features in its molecular graph. To organize these limits, the work defines a four-category taxonomy of gaps that separate what structure can encode from what is seen in practice. This distinction matters because it clarifies the boundary between what graph-based models can and cannot deliver for toxicity prediction.

Core claim

A Message Passing Neural Network trained on Tox21 data, followed by GNNExplainer atom attributions, attributes only five of eleven documented adverse effects of acetylsalicylic acid to its molecular graph. The study introduces a Gap Taxonomy with four categories: principally non-encodable effects (GAP-1), data gaps from Missing Not At Random mechanisms (GAP-2), assay panel mismatches (GAP-3), and representation errors (GAP-4). The MNAR component is quantified by a ChEMBL query that returns zero bioactivity entries across 42 assays, and an attention pooling test localizes representation error to the message passing layers.

What carries the argument

The Gap Taxonomy (GAP-1 to GAP-4) that partitions structural information limits into non-encodable effects, MNAR data gaps, assay mismatches, and message-passing representation errors.

If this is right

Only the subset of adverse effects that are structurally inferable can be addressed by any graph-based toxicity model.
The taxonomy applies to every structure-based prediction method, not only GNNs.
Complete drug safety assessment requires data sources beyond molecular graphs for the remaining gap categories.
Representation error in MPNNs arises inside the message passing layers rather than during final aggregation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gap categories are likely to appear for other well-characterized drugs.
Safety models could be extended to output both a prediction and the gap category that applies to each endpoint.
Regulatory frameworks may need explicit provisions for effects that lie outside structural inference.

Load-bearing premise

The set of eleven known adverse effects for aspirin is treated as complete and representative, and GNNExplainer attributions are assumed to correctly separate structurally inferable effects from the rest.

What would settle it

Re-running the attribution analysis with a different explainer or larger adverse-effect list that attributes more than five of the eleven effects to structural features.

Figures

Figures reproduced from arXiv: 2605.26183 by Juergen Dietrich.

read the original abstract

Not all clinically relevant adverse effects are structurally inferable from molecular graphs - regardless of model quality or architectural complexity. This study introduces an operational taxonomy of the structural information limits that prevent structure-based toxicity prediction, independent of the learning algorithm employed. Graph Neural Networks (GNNs) have emerged as a natural approach for molecular toxicity prediction, operating directly on atomic connectivity without the information loss inherent to fixed-length fingerprints. However, the fraction of a drug's known pharmacological profile that is actually inferable from molecular structure remains systematically underexplored. A systematic case study using acetylsalicylic acid (ASA, Aspirin) - one of the most comprehensively characterized drugs in pharmacology - serves as model compound. A Message Passing Neural Network (MPNN) is trained on the Tox21 benchmark and GNNExplainer is applied to characterize atom-level attribution. Results indicate that molecular structure explains approximately 45% (5/11) of known ASA adverse effects. A four-category Gap Taxonomy (GAP-1 through GAP-4) is introduced distinguishing between principally non-encodable effects, data gaps arising from Missing Not At Random (MNAR) mechanisms, assay panel mismatches, and representation errors. The MNAR gap is empirically quantified via a systematic ChEMBL query (42 documented assays, 0 retrievable bioactivity entries). An attention pooling experiment localizes the representation error to the MPNN message passing layers rather than the aggregation step. The Gap Taxonomy has direct implications for drug safety signal detection and regulatory frameworks including Good Pharmacovigilance Practice (GVP) guidelines and New Approach Methodologies (NAMs). Structural limits identified are confirmed in a companion DDI ablation study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clear four-category taxonomy for why GNNs miss some toxicity effects, but the 5/11 number for aspirin rests on GNNExplainer attributions from one Tox21 model and does not prove those effects are non-inferable from structure.

read the letter

The paper introduces a taxonomy of four gaps that keep molecular graphs from explaining all known adverse effects, illustrated on aspirin where they report structure accounts for 5 of 11 effects.

It does a decent job laying out the categories: principally non-encodable effects, missing-not-at-random data gaps, assay mismatches, and representation errors inside the model. The ChEMBL query showing zero bioactivity entries across 42 assays gives a concrete example of the data gap. The attention pooling check is a reasonable step to locate the representation problem in the message-passing layers rather than the readout.

The central 5/11 split is the soft spot. It comes from applying GNNExplainer to an MPNN trained on Tox21 and counting effects that receive no relevant atom attributions. Six effects get none, so they are treated as non-inferable. This does not follow. Tox21 endpoints may not cover the relevant mechanisms, message passing can fail to propagate certain substructures, and GNNExplainer is known to be incomplete and sensitive to its settings. The paper states the gaps are independent of the learning algorithm, yet the result is tied to this specific model and explainer.

The list of 11 aspirin effects is also taken as given without discussion of completeness or selection.

This is for computational toxicologists and regulators working on New Approach Methodologies who want a structured way to talk about what structure-based models will always miss. A reader focused on conceptual framing rather than a finished method will get value from the categories.

It deserves peer review so the taxonomy can be checked on more compounds and with other explainers.

Referee Report

2 major / 1 minor

Summary. The paper claims that molecular structure is fundamentally limited in explaining drug adverse effects, independent of model architecture. Using acetylsalicylic acid (ASA) as a case study, an MPNN trained on Tox21 is paired with GNNExplainer to conclude that structure accounts for only ~45% (5/11) of known ASA adverse effects. A four-category Gap Taxonomy (GAP-1: principally non-encodable; GAP-2: MNAR data gaps; GAP-3: assay panel mismatches; GAP-4: representation errors) is introduced, with the MNAR component quantified by a ChEMBL query (42 assays, zero bioactivity entries) and representation error localized to message-passing layers via an attention-pooling ablation. The taxonomy is presented as algorithm-independent and relevant to GVP guidelines and NAMs, with confirmation claimed from a companion DDI study.

Significance. If the taxonomy and 45% quantification prove robust across models, the work would usefully bound expectations for structure-based toxicity models and inform when negative predictions should not be over-interpreted. Credit is due for the systematic ChEMBL query that concretely demonstrates an MNAR gap and for the attention-pooling experiment that isolates the representation issue to the message-passing stage rather than aggregation. These elements provide a reproducible template for similar gap analyses even if the specific 5/11 split requires further grounding.

major comments (2)

[Abstract and Results] Abstract and Results: The central claim that the Gap Taxonomy is 'independent of the learning algorithm employed' is undercut by the empirical 5/11 quantification, which rests exclusively on GNNExplainer attributions from one MPNN trained on Tox21. Absence of atom attributions does not demonstrate that an effect is principally non-encodable from structure; it may reflect incomplete coverage of relevant mechanisms in Tox21, insufficient propagation in message passing, or known incompleteness of GNNExplainer. No cross-model or cross-explainer validation is reported to support algorithm independence.
[Results (ASA case study)] Results (ASA case study): The classification of 6 of the 11 adverse effects as non-inferable depends on post-hoc assignment of GNNExplainer outputs to taxonomy categories without reported inter-rater reliability, sensitivity analysis on explainer hyperparameters, or comparison against a null model. This makes the 45% figure and the operational definitions of GAP-1 through GAP-4 load-bearing yet fragile; the taxonomy categories appear tailored to the observed attributions rather than independently validated.

minor comments (1)

[Abstract] Abstract: The companion DDI ablation study is invoked for confirmation but receives no methodological detail or citation; if present in the full manuscript, ensure it is explicitly cross-referenced so readers can assess whether it addresses the model-dependence concern.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the value of the ChEMBL query and attention-pooling ablation. We address the two major comments point by point below, proposing targeted revisions to strengthen the manuscript while preserving its core contribution.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results: The central claim that the Gap Taxonomy is 'independent of the learning algorithm employed' is undercut by the empirical 5/11 quantification, which rests exclusively on GNNExplainer attributions from one MPNN trained on Tox21. Absence of atom attributions does not demonstrate that an effect is principally non-encodable from structure; it may reflect incomplete coverage of relevant mechanisms in Tox21, insufficient propagation in message passing, or known incompleteness of GNNExplainer. No cross-model or cross-explainer validation is reported to support algorithm independence.

Authors: We agree that the specific 5/11 mapping in the ASA case study depends on attributions from a single MPNN and GNNExplainer, and that non-attribution cannot alone prove a gap is principally non-encodable. The taxonomy categories themselves are defined from pharmacological mechanisms, data availability patterns, and assay characteristics rather than from any particular model's outputs. However, the empirical illustration of the taxonomy does rely on this one pipeline. We will revise the abstract, introduction, and discussion to state that the taxonomy is proposed as a general conceptual framework whose categories are intended to be model-agnostic, while clarifying that the 5/11 quantification is an illustrative case study whose precise split may vary with model or explainer choice. A new limitations paragraph will explicitly note the absence of cross-model validation and recommend it as future work. revision: partial
Referee: [Results (ASA case study)] Results (ASA case study): The classification of 6 of the 11 adverse effects as non-inferable depends on post-hoc assignment of GNNExplainer outputs to taxonomy categories without reported inter-rater reliability, sensitivity analysis on explainer hyperparameters, or comparison against a null model. This makes the 45% figure and the operational definitions of GAP-1 through GAP-4 load-bearing yet fragile; the taxonomy categories appear tailored to the observed attributions rather than independently validated.

Authors: The assignments were performed by the authors using explicit criteria linking each adverse effect's known mechanism to the presence or absence of relevant atom attributions and to the four gap definitions. We acknowledge that no inter-rater reliability statistic, hyperparameter sensitivity sweep, or null-model comparison was reported. We will add a sensitivity analysis varying GNNExplainer hyperparameters (learning rate, epochs, and mask size) and report how the 5/11 count changes. We will also include a brief comparison against a random-attribution baseline. Because the taxonomy was developed from first principles before the case study was run, we do not believe the categories were post-hoc tailored, but we will make the development sequence clearer in the methods. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results rest on external data and model application

full rationale

The paper's 5/11 quantification and Gap Taxonomy derive from applying GNNExplainer attributions to an MPNN trained on the independent Tox21 benchmark, combined with a direct ChEMBL database query for MNAR effects and an attention pooling ablation. These steps use external benchmarks and observed model outputs rather than defining the taxonomy or results in terms of themselves. No self-citations, fitted parameters renamed as predictions, or definitional loops appear in the derivation chain. The taxonomy organizes the empirical gaps without forcing the 45% figure by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the 11 adverse effects form a complete reference set and that attribution methods correctly isolate structural information; the taxonomy categories are introduced without external validation.

axioms (1)

domain assumption The 11 documented adverse effects for ASA constitute a complete and unbiased reference set for measuring structural explainability.
Directly used to compute the 5/11 fraction in the abstract.

invented entities (1)

GAP-1 to GAP-4 taxonomy no independent evidence
purpose: Classify structural information limits in toxicity prediction
New categories defined by the authors to organize observed gaps; no independent evidence provided.

pith-pipeline@v0.9.1-grok · 5838 in / 1213 out tokens · 45577 ms · 2026-06-29T19:32:17.911529+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Detection to Mechanism: Cross-Attention Graph Neural Networks Enable Drug-Drug Interaction Type Prediction An Ablation Study with Acetylsalicylic Acid Validation
cs.LG 2026-05 unverdicted novelty 4.0

Cross-attention GNNs raise multi-class F1-macro for DDI type prediction by 0.186 over concatenation baselines on a 38k-pair benchmark, with 10/10 held-out ASA validation success versus 0/10 for a ternary model.

Reference graph

Works this paper leans on

15 extracted references · 2 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Geometric deep learning: Going beyond Euclidean data.IEEE Signal Processing Magazine, 34(4):18–42, 2017

Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: Going beyond Euclidean data.IEEE Signal Processing Magazine, 34(4):18–42, 2017

2017
[2]

Convolutional networks on graphs for learning molecular fingerprints

David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, et al. Convolutional networks on graphs for learning molecular fingerprints. InAdvances in Neural Information Processing Systems, volume 28, 2015

2015
[3]

Neural message passing for quantum chemistry

Justin Gilmer, Stephan S Schütt, George E Dahl, Oriol Vinyals, and Patrick Riley. Neural message passing for quantum chemistry. InProceedings of the 34th International Conference on Machine Learning, pages 1263–1272, 2017

2017
[4]

DeepTox: Toxicity prediction using deep learning.Frontiers in Environmental Science, 3:80, 2016

Andreas Mayr, Günter Klambauer, Thomas Unterthiner, and Sepp Hochreiter. DeepTox: Toxicity prediction using deep learning.Frontiers in Environmental Science, 3:80, 2016

2016
[5]

Analyzing learned molecular representations for property prediction.Journal of Chemical Information and Modeling, 59(8):3370–3388, 2019

Kevin Yang, Kyle Swanson, Wengong Jin, et al. Analyzing learned molecular representations for property prediction.Journal of Chemical Information and Modeling, 59(8):3370–3388, 2019

2019
[6]

GN- NExplainer: Generating explanations for graph neural networks

Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. GN- NExplainer: Generating explanations for graph neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019
[7]

Inference and missing data.Biometrika, 63(3):581–592, 1976

Donald B Rubin. Inference and missing data.Biometrika, 63(3):581–592, 1976

1976
[8]

Time-split cross-validation as a method for estimating the goodness of prospective prediction.Journal of Chemical Information and Modeling, 53(4):783–790, 2013

Robert P Sheridan. Time-split cross-validation as a method for estimating the goodness of prospective prediction.Journal of Chemical Information and Modeling, 53(4):783–790, 2013

2013
[9]

Predictive multitask deep neural net- work models for ADME-Tox properties.Journal of Chemical Information and Modeling, 59(3):1253–1268, 2019

Jan Wenzel, Hans Matter, and Frank Schmidt. Predictive multitask deep neural net- work models for ADME-Tox properties.Journal of Chemical Information and Modeling, 59(3):1253–1268, 2019

2019
[10]

Gated Graph Sequence Neural Networks

Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks.arXiv preprint arXiv:1511.05493, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

ChEMBL: towards direct deposition of bioassay data.Nucleic Acids Research, 47(D1):D930–D940, 2019

David Mendez, Anna Gaulton, A Patricia Bento, et al. ChEMBL: towards direct deposition of bioassay data.Nucleic Acids Research, 47(D1):D930–D940, 2019

2019
[12]

RDKit: Open-source cheminformatics.rdkit.org, 2006

Greg Landrum. RDKit: Open-source cheminformatics.rdkit.org, 2006

2006
[13]

DrugBank 5.0: A major update to the DrugBank database for 2018.Nucleic Acids Research, 46(D1):D1074–D1082, 2018

David S Wishart, Yannick D Feunang, An Chi Guo, et al. DrugBank 5.0: A major update to the DrugBank database for 2018.Nucleic Acids Research, 46(D1):D1074–D1082, 2018

2018
[14]

From Detection to Mechanism: Cross-Attention Graph Neural Networks Enable Drug-Drug Interaction Type Prediction An Ablation Study with Acetylsalicylic Acid Validation

J. Dietrich. From detection to mechanism: Cross-attention graph neural networks enable drug-drug interaction type prediction — an ablation study with acetylsalicylic acid validation. arXiv preprint arXiv:2605.27861, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Optimizing drug-target interaction prediction with federated learning.Nature Communications, 14:4064, 2023

Martijn Oldenhof, Adam Arany, Yves Moreau, and Jaak Simm. Optimizing drug-target interaction prediction with federated learning.Nature Communications, 14:4064, 2023. 13

2023

[1] [1]

Geometric deep learning: Going beyond Euclidean data.IEEE Signal Processing Magazine, 34(4):18–42, 2017

Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: Going beyond Euclidean data.IEEE Signal Processing Magazine, 34(4):18–42, 2017

2017

[2] [2]

Convolutional networks on graphs for learning molecular fingerprints

David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, et al. Convolutional networks on graphs for learning molecular fingerprints. InAdvances in Neural Information Processing Systems, volume 28, 2015

2015

[3] [3]

Neural message passing for quantum chemistry

Justin Gilmer, Stephan S Schütt, George E Dahl, Oriol Vinyals, and Patrick Riley. Neural message passing for quantum chemistry. InProceedings of the 34th International Conference on Machine Learning, pages 1263–1272, 2017

2017

[4] [4]

DeepTox: Toxicity prediction using deep learning.Frontiers in Environmental Science, 3:80, 2016

Andreas Mayr, Günter Klambauer, Thomas Unterthiner, and Sepp Hochreiter. DeepTox: Toxicity prediction using deep learning.Frontiers in Environmental Science, 3:80, 2016

2016

[5] [5]

Analyzing learned molecular representations for property prediction.Journal of Chemical Information and Modeling, 59(8):3370–3388, 2019

Kevin Yang, Kyle Swanson, Wengong Jin, et al. Analyzing learned molecular representations for property prediction.Journal of Chemical Information and Modeling, 59(8):3370–3388, 2019

2019

[6] [6]

GN- NExplainer: Generating explanations for graph neural networks

Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. GN- NExplainer: Generating explanations for graph neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019

[7] [7]

Inference and missing data.Biometrika, 63(3):581–592, 1976

Donald B Rubin. Inference and missing data.Biometrika, 63(3):581–592, 1976

1976

[8] [8]

Time-split cross-validation as a method for estimating the goodness of prospective prediction.Journal of Chemical Information and Modeling, 53(4):783–790, 2013

Robert P Sheridan. Time-split cross-validation as a method for estimating the goodness of prospective prediction.Journal of Chemical Information and Modeling, 53(4):783–790, 2013

2013

[9] [9]

Predictive multitask deep neural net- work models for ADME-Tox properties.Journal of Chemical Information and Modeling, 59(3):1253–1268, 2019

Jan Wenzel, Hans Matter, and Frank Schmidt. Predictive multitask deep neural net- work models for ADME-Tox properties.Journal of Chemical Information and Modeling, 59(3):1253–1268, 2019

2019

[10] [10]

Gated Graph Sequence Neural Networks

Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks.arXiv preprint arXiv:1511.05493, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[11] [11]

ChEMBL: towards direct deposition of bioassay data.Nucleic Acids Research, 47(D1):D930–D940, 2019

David Mendez, Anna Gaulton, A Patricia Bento, et al. ChEMBL: towards direct deposition of bioassay data.Nucleic Acids Research, 47(D1):D930–D940, 2019

2019

[12] [12]

RDKit: Open-source cheminformatics.rdkit.org, 2006

Greg Landrum. RDKit: Open-source cheminformatics.rdkit.org, 2006

2006

[13] [13]

DrugBank 5.0: A major update to the DrugBank database for 2018.Nucleic Acids Research, 46(D1):D1074–D1082, 2018

David S Wishart, Yannick D Feunang, An Chi Guo, et al. DrugBank 5.0: A major update to the DrugBank database for 2018.Nucleic Acids Research, 46(D1):D1074–D1082, 2018

2018

[14] [14]

From Detection to Mechanism: Cross-Attention Graph Neural Networks Enable Drug-Drug Interaction Type Prediction An Ablation Study with Acetylsalicylic Acid Validation

J. Dietrich. From detection to mechanism: Cross-attention graph neural networks enable drug-drug interaction type prediction — an ablation study with acetylsalicylic acid validation. arXiv preprint arXiv:2605.27861, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Optimizing drug-target interaction prediction with federated learning.Nature Communications, 14:4064, 2023

Martijn Oldenhof, Adam Arany, Yves Moreau, and Jaak Simm. Optimizing drug-target interaction prediction with federated learning.Nature Communications, 14:4064, 2023. 13

2023