Recognition: unknown
ISAAC: Auditing Causal Reasoning in Deep Models for Drug-Target Interaction
Pith reviewed 2026-05-10 15:39 UTC · model grok-4.3
The pith
ISAAC reveals that deep learning models for drug-target prediction can differ substantially in causal reasoning even when their accuracy is nearly the same.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ISAAC is a post-hoc framework that evaluates prior-relative structural sensitivity by probing frozen models through matched mechanistic and spurious input-level interventions, independently of predictive accuracy. Applied to three sequence-based DTI architectures on the Davis benchmark, it reveals approximately 25 percent relative differences in reasoning scores across models with comparable AUROC within around 3 percent, with stability across training and intervention seeds and two distinct perturbation operators. These discrepancies, undetectable under conventional accuracy metrics, motivate the use of post-hoc structural auditing as a complement to standard performance evaluation in the D
What carries the argument
ISAAC, the Intervention-based Structural Auditing Approach for Causal Reasoning, a post-hoc framework that probes frozen models with matched mechanistic and spurious input-level interventions to measure structural sensitivity independent of accuracy.
If this is right
- Models with nearly identical AUROC can exhibit meaningfully different reliance on mechanistic versus spurious features.
- Conventional accuracy metrics alone are insufficient to certify the scientific validity of molecular prediction models.
- Post-hoc auditing for structural sensitivity provides a practical complement to benchmark performance in scientific machine learning.
- The observed stability of reasoning scores across seeds and operators supports treating ISAAC scores as reproducible model properties.
- Sequence-based DTI architectures are not interchangeable even when their predictive accuracy matches.
Where Pith is reading between the lines
- If higher ISAAC reasoning scores correlate with better performance on unseen compound classes, the framework could serve as a selection criterion during model development.
- Adapting the same intervention-matching logic to graph-based or 3D-structure DTI models would test whether the 25 percent gap generalizes beyond sequence inputs.
- In drug discovery pipelines, models with lower ISAAC scores might be flagged for additional mechanistic validation before use in virtual screening.
- Extending ISAAC-style audits to related tasks such as protein-ligand binding affinity or toxicity prediction could expose similar hidden reasoning differences.
Load-bearing premise
The chosen input-level interventions can be reliably labeled as mechanistic versus spurious in a matched way that isolates causal reasoning in sequence-based DTI models.
What would settle it
Repeating the full ISAAC evaluation on the same three models and finding that reasoning scores show no relative differences exceeding a few percent or that results vary strongly with the choice of perturbation operator would falsify the reported discrepancies and stability.
read the original abstract
Deep learning models for drug--target interaction (DTI) prediction often achieve strong benchmark performance without necessarily relying on mechanistically meaningful molecular features, a limitation that standard accuracy-based evaluation cannot detect. We introduce ISAAC (Intervention-based Structural Auditing Approach for Causal Reasoning), a post-hoc framework that evaluates prior-relative structural sensitivity by probing frozen models through matched mechanistic and spurious input-level interventions, independently of predictive accuracy. Applied to three sequence-based DTI architectures on the Davis benchmark, ISAAC reveals approximately 25\% relative differences in reasoning scores across models with comparable AUROC (within around 3\%), stable across training and intervention seeds and two distinct perturbation operators. These discrepancies, undetectable under conventional accuracy metrics, motivate the use of post-hoc structural auditing as a complement to standard performance evaluation in scientific machine learning for molecular modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ISAAC, a post-hoc auditing framework that probes frozen sequence-based DTI models with matched mechanistic and spurious input-level interventions to compute a reasoning score measuring prior-relative structural sensitivity, independent of predictive accuracy. On the Davis benchmark, it reports that three architectures with comparable AUROC (within ~3%) exhibit ~25% relative differences in reasoning scores, with stability across training/intervention seeds and two perturbation operators.
Significance. If the intervention labeling and score computation are shown to isolate causal reasoning without confounding by sequence properties, the work would usefully demonstrate that accuracy metrics alone miss important mechanistic differences in molecular ML models. The empirical gap between AUROC parity and reasoning-score divergence, plus seed stability, would be a concrete contribution to post-hoc auditing in scientific ML.
major comments (3)
- [Methods] Methods section on intervention construction and labeling: the partitioning of perturbations into mechanistic versus spurious classes lacks any quantitative validation (e.g., inter-rater agreement, ablation removing the label step, or checks against correlated features such as sequence length or hydrophobicity). This labeling step is load-bearing for the central claim that observed reasoning-score gaps reflect causal reasoning rather than sensitivity to other input statistics.
- [Results / Methods] Results and Methods on reasoning score: no explicit equation, pseudocode, or derivation is supplied for how the reasoning score is computed from the intervention outcomes (e.g., how prior-relative structural sensitivity is aggregated or normalized). Without this, it is impossible to verify the claimed independence from accuracy or to reproduce the reported 25% relative differences.
- [Experiments] Experiments on Davis benchmark: the headline result (25% reasoning-score gap with ~3% AUROC parity) is presented without an ablation that holds the label assignment fixed while varying only the model architecture, leaving open whether the gap arises from the models' causal sensitivity or from differential sensitivity to the particular perturbation operators chosen.
minor comments (2)
- [Abstract] Abstract and introduction: the phrase 'prior-relative structural sensitivity' is used without a concise definition or reference to the precise quantity being measured.
- [Figures / Tables] Figure captions and tables: stability across seeds is asserted but the number of seeds and exact variance values are not reported in the main text or captions.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's report. We address each major comment point by point below, indicating revisions where appropriate to improve the clarity and rigor of the manuscript.
read point-by-point responses
-
Referee: [Methods] Methods section on intervention construction and labeling: the partitioning of perturbations into mechanistic versus spurious classes lacks any quantitative validation (e.g., inter-rater agreement, ablation removing the label step, or checks against correlated features such as sequence length or hydrophobicity). This labeling step is load-bearing for the central claim that observed reasoning-score gaps reflect causal reasoning rather than sensitivity to other input statistics.
Authors: The intervention labeling is grounded in domain expertise from molecular biology, where mechanistic perturbations target residues known to influence drug binding affinity based on prior literature, while spurious ones do not. We recognize the value of additional validation and will include in the revision: (1) correlation checks with sequence features such as length and hydrophobicity to rule out confounding, and (2) an ablation that bypasses the labeling to assess its impact. Inter-rater agreement can be added if we consult additional experts, though the current labeling follows a deterministic rule based on binding site annotations. revision: partial
-
Referee: [Results / Methods] Results and Methods on reasoning score: no explicit equation, pseudocode, or derivation is supplied for how the reasoning score is computed from the intervention outcomes (e.g., how prior-relative structural sensitivity is aggregated or normalized). Without this, it is impossible to verify the claimed independence from accuracy or to reproduce the reported 25% relative differences.
Authors: We apologize for the omission of the explicit formulation. The reasoning score is computed as RS = (P_mech - P_spur) / P_prior, where P denotes the model's predicted interaction probability under each condition, aggregated over multiple interventions and normalized to ensure independence from baseline accuracy. We will add the full equation, pseudocode for the computation, and a derivation showing why this isolates structural sensitivity in the revised Methods section. revision: yes
-
Referee: [Experiments] Experiments on Davis benchmark: the headline result (25% reasoning-score gap with ~3% AUROC parity) is presented without an ablation that holds the label assignment fixed while varying only the model architecture, leaving open whether the gap arises from the models' causal sensitivity or from differential sensitivity to the particular perturbation operators chosen.
Authors: The experimental design applies the identical set of labeled interventions to all three model architectures on the Davis benchmark, thereby holding the label assignment fixed while varying only the model. This is already the case in the reported results. To make this explicit, we will add a sentence in the Experiments section clarifying that the intervention set is shared across models. The reported stability across two perturbation operators further supports that the gaps are not due to operator choice. revision: partial
Circularity Check
No significant circularity detected in ISAAC derivation chain.
full rationale
The ISAAC framework is presented as a post-hoc auditing method applied to already-trained, frozen models. It computes reasoning scores via sensitivity to explicitly defined input-level interventions (mechanistic vs. spurious) that are independent of the models' predictive accuracy or training loss. No equations, fitted parameters, or self-referential definitions are shown that would make the reported 25% relative score differences equivalent to the input data or labels by construction. The abstract explicitly states independence from accuracy metrics and stability across seeds, with no load-bearing self-citations or ansatzes invoked to force the result. The central empirical claim therefore remains a non-tautological observation rather than a renaming or re-derivation of its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Input-level perturbations can be partitioned into matched mechanistic and spurious sets that isolate causal reasoning in frozen sequence-based DTI models.
invented entities (1)
-
Reasoning score
no independent evidence
Reference graph
Works this paper leans on
-
[1]
S., Brendel, W., Bethge, M., and Wichmann, F
Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R. S., Brendel, W., Bethge, M., and Wichmann, F. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2:665--673, 2020
2020
-
[2]
Adversarial examples are not bugs, they are features
Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., and Madry, A. Adversarial examples are not bugs, they are features. In NeurIPS, 2019
2019
-
[3]
Kirichenko, P., Izmailov, P., and Wilson, A. G. Why normalizing flows fail to detect out-of-distribution data. In NeurIPS, 2020
2020
-
[4]
Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019
work page internal anchor Pith review arXiv 1907
-
[5]
Lundberg, S. M. and Lee, S.-I. A unified approach to interpreting model predictions. In NeurIPS, pp. 4768--4777, 2017
2017
-
[6]
Causal inference by using invariant prediction
Peters, J., B\"uhlmann, P., and Meinshausen, N. Causal inference by using invariant prediction. JRSS B, 78(5):947--1012, 2016
2016
-
[7]
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
Hendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019
work page internal anchor Pith review arXiv 1903
-
[8]
Measuring robustness to natural distribution shifts
Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., and Schmidt, L. Measuring robustness to natural distribution shifts. In NeurIPS, 2020
2020
-
[9]
Causality
Pearl, J. Causality. Cambridge University Press, 2nd edition, 2009
2009
-
[10]
Toward causal representation learning
Sch\"olkopf, B., Locatello, F., Bauer, S., et al. Toward causal representation learning. Proceedings of the IEEE, 109:612--634, 2021
2021
-
[11]
atsch, G., Sch\
Locatello, F., Poole, B., R\"atsch, G., Sch\"olkopf, B., Bachem, O., and Tschannen, M. Weakly-supervised disentanglement without compromises. In ICML, 2020
2020
-
[12]
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks. arXiv preprint arXiv:1312.6034, 2014
work page Pith review arXiv 2014
-
[13]
Axiomatic attribution for deep networks
Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep networks. In ICML, pp. 3319--3328, 2017
2017
-
[14]
T., Singh, S., and Guestrin, C
Ribeiro, M. T., Singh, S., and Guestrin, C. Why should I trust you? In KDD, pp. 1135--1144, 2016
2016
-
[15]
Towards A Rigorous Science of Interpretable Machine Learning
Doshi-Velez, F. and Kim, B. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017
work page internal anchor Pith review arXiv 2017
-
[16]
Sanity checks for saliency maps
Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity checks for saliency maps. In NeurIPS, 2018
2018
-
[17]
RE-IMAGINE: Symbolic benchmark synthesis for reasoning evaluation
Xu, X., et al. RE-IMAGINE: Symbolic benchmark synthesis for reasoning evaluation. In ICML, 2025
2025
-
[18]
LLMScan: Causal scan for LLM misbehavior detection
Zhang, M., et al. LLMScan: Causal scan for LLM misbehavior detection. In ICML, 2025
2025
-
[19]
Ozt\"urk, H., \
\"Ozt\"urk, H., \"Ozg\"ur, A., and \"Ozkirimli, E. DeepDTA: Deep drug--target binding affinity prediction. Bioinformatics, 34:i821--i829, 2018
2018
-
[20]
Y., et al
Gao, K. Y., et al. Interpretable drug target prediction. In IJCAI, pp. 3371--3377, 2018
2018
-
[21]
Deep learning for drug-drug interaction prediction
Li, X., et al. Deep learning for drug-drug interaction prediction. Quantitative Biology, 12:30--52, 2024
2024
-
[22]
MolTrans: Molecular interaction transformer
Huang, K., Xiao, C., Glass, L., and Sun, J. MolTrans: Molecular interaction transformer. Bioinformatics, 37:830--836, 2020
2020
-
[23]
GraphDTA
Nguyen, T., et al. GraphDTA. Bioinformatics, 37:1140--1147, 2020
2020
-
[24]
DeepConv-DTI
Lee, I., Keum, J., and Nam, H. DeepConv-DTI. PLOS Computational Biology, 15, 2019
2019
-
[25]
Lin, G., et al. TAPB. Nature Communications, 16, 2025
2025
-
[26]
TransformerCPI
Chen, L., et al. TransformerCPI. Bioinformatics, 36:4406--4414, 2020
2020
-
[27]
Bai, P., et al. DrugBAN. Nature Machine Intelligence, 5:126--136, 2022
2022
-
[28]
J., et al
Kooistra, A. J., et al. KLIFS database. Nucleic Acids Research, 44:D365--D371, 2015
2015
-
[29]
and Kumbier, K
Yu, B. and Kumbier, K. Veridical data science. PNAS, 117:3920--3929, 2020
2020
-
[30]
Causal abstractions of neural networks
Geiger, A., et al. Causal abstractions of neural networks. In NeurIPS, 2021
2021
-
[31]
Investigating gender bias using causal mediation
Vig, J., et al. Investigating gender bias using causal mediation. In NeurIPS, 2020
2020
-
[32]
Perturbation-based methods for explaining neural networks
Ivanovs, M., Kadikis, R., and Ozols, K. Perturbation-based methods for explaining neural networks. Pattern Recognition Letters, 150:228--234, 2021
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.