Recognition: unknown
On the Robustness of Interpretability Methods
read the original abstract
We argue that robustness of explanations---i.e., that similar inputs should give rise to similar explanations---is a key desideratum for interpretability. We introduce metrics to quantify robustness and demonstrate that current methods do not perform well according to these metrics. Finally, we propose ways that robustness can be enforced on existing interpretability approaches.
This paper has not been read by Pith yet.
Forward citations
Cited by 7 Pith papers
-
Do Fair Models Reason Fairly? Counterfactual Explanation Consistency for Procedural Fairness in Credit Decisions
Outcome-fair credit models often exhibit hidden procedural bias through inconsistent reasoning across groups, which the CEC framework mitigates by enforcing consistent feature attributions via counterfactuals.
-
Feature Attribution Stability Suite: How Stable Are Post-Hoc Attributions?
FASS benchmark shows post-hoc attributions remain unstable under geometric perturbations even after filtering for unchanged predictions, with Grad-CAM exhibiting the highest stability across ImageNet, COCO, and CIFAR-10.
-
Interpretability Can Be Actionable
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
-
Beyond the Wrapper: Identifying Artifact Reliance in Static Malware Classifiers using TRUSTEE
Static malware classifiers learn packing artifacts and dataset composition biases rather than malicious semantics, as diagnosed by TRUSTEE interpretability across controlled dataset variations.
-
A renormalization-group inspired lattice-based framework for piecewise generalized linear models
RG-inspired lattice models for piecewise GLMs provide explicit interpretable partitions and a replica-analysis-derived scaling law for regularization that allows increasing complexity without expected rise in generali...
-
NEURON: A Neuro-symbolic System for Grounded Clinical Explainability
NEURON raises AUC from 0.74-0.77 to 0.84-0.88 on MIMIC-IV heart-failure mortality prediction while lifting human-aligned explanation scores from 0.50 to 0.85 by grounding SHAP values in SNOMED CT and patient notes via...
-
Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs
HETA is a new attribution framework for decoder-only LLMs that combines semantic transition vectors, Hessian-based sensitivity scores, and KL divergence to produce more faithful and human-aligned token attributions th...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.