hub Canonical reference

Towards A Rigorous Science of Interpretable Machine Learning

Finale Doshi-Velez, Been Kim · 2017 · stat.ML · arXiv 1702.08608

Canonical reference. 71% of citing Pith papers cite this work as background.

74 Pith papers citing it

Background 71% of classified citations

open full Pith review browse 74 citing papers arXiv PDF

abstract

As machine learning systems become ubiquitous, there has been a surge of interest in interpretable machine learning: systems that provide explanation for their outputs. These explanations are often used to qualitatively assess other criteria such as safety or non-discrimination. However, despite the interest in interpretability, there is very little consensus on what interpretable machine learning is and how it should be measured. In this position paper, we first define interpretability and describe when interpretability is needed (and when it is not). Next, we suggest a taxonomy for rigorous evaluation and expose open questions towards a more rigorous science of interpretable machine learning.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 method 1

citation-polarity summary

background 10 support 3 use method 1

representative citing papers

Embodied Explainability and Ontological Obstacles: Why We Struggle to Explain the Answers of Large Language Models (LLMs)

cs.HC · 2026-06-22 · unverdicted · novelty 7.0

An argument paper reframes LLM explainability as an embodied, situated practice based on Dourish and enactivist cognition, identifying ontological obstacles in internal explanations and advocating affordance-based designs.

Agentic-imodels: Evolving agentic interpretability tools via autoresearch

cs.AI · 2026-05-05 · unverdicted · novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

ISAAC: Auditing Causal Reasoning in Deep Models for Drug-Target Interaction

cs.LG · 2026-05-03 · unverdicted · novelty 7.0

ISAAC auditing applied to three DTI models on the Davis benchmark finds 25% relative differences in causal reasoning scores despite nearly identical AUROC values.

In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks

cs.LG · 2026-03-16 · unverdicted · novelty 7.0

In-context symbolic regression methods improve robustness of symbolic formula recovery from KANs, cutting median OFAT test MSE by up to 99.8 percent across hyperparameter sweeps.

Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild

cs.SE · 2026-01-25 · conditional · novelty 7.0

Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.

Extremal Contours: Gradient-driven contours for compact visual attribution

cs.CV · 2025-11-03 · unverdicted · novelty 7.0

A training-free method using Fourier-parameterized star-convex contours optimized via gradients to generate compact, faithful visual attributions for image classifiers on benchmarks like ImageNet.

Temporal Counterfactual Explanations of Behaviour Tree Decisions

cs.RO · 2025-09-09 · unverdicted · novelty 7.0

A method automatically constructs a causal model from behavior tree structure and domain knowledge to generate real-time causal counterfactual explanations for robot decisions.

Mechanistic Interpretability with Sparse Autoencoder Neural Operators

cs.LG · 2025-09-03 · unverdicted · novelty 7.0

SAE-NOs extend sparse autoencoders to function spaces via Fourier neural operators with concept and domain sparsity, learning localized patterns more efficiently and generalizing across discretizations on vision data.

MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

cs.CV · 2025-08-11 · unverdicted · novelty 7.0

MIMIC is a new inversion framework that recovers visual concepts from VLM internal states using joint inversion, feature alignment, and three regularizers.

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

cs.CL · 2023-05-07 · accept · novelty 7.0

Chain-of-thought explanations in LLMs are frequently unfaithful: models systematically omit mention of biasing prompt features that change their answers and instead produce rationalizations for those biased outputs.

Investigating Concept Alignment Using Implausible Category Members

cs.AI · 2026-05-20 · unverdicted · novelty 6.0

AI models misalign with humans on concept boundaries when probed with implausible category members, such as classifying words as vehicles or vegetables as fruit.

Interpretable Computer Vision for Defect Detection in X-ray Tomography of Aerospace SiC/SiC Composites

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

p-ResNet-50 adds a prototype layer with anchor- and medoid-based regularizations to ResNet-50, achieving ROC-AUC 0.994 and accuracy 0.957 on ~12k XCT patches while supplying case-based explanations aligned to expert categories.

Bridging the Disciplinary Gap in Explainable AI: From Abstract Desiderata to Concrete Tasks

cs.CY · 2026-05-19 · unverdicted · novelty 6.0

The authors introduce a taxonomy with target, functional role, and mode of justification axes plus a framework that decomposes abstract XAI desiderata into concrete benchmarkable tasks via identified dependency structures.

Entropy-Based Characterisation of the Polarised Regime in Latent Variable Models

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

An entropy criterion on mean representations characterises the polarised regime in VAEs and related models, with theoretical links to KL minimisation and empirical tests across several architectures.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Interpretability Can Be Actionable

cs.LG · 2026-05-11 · conditional · novelty 6.0

Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.

The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.

ShifaMind: A Multiplicative Concept Bottleneck for Interpretable ICD-10 Coding

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

ShifaMind achieves competitive performance with the LAAT baseline on MIMIC-IV top-50 ICD-10 coding while outperforming vanilla concept bottleneck models and providing concept-mediated explanations.

Evaluation Cards for XAI Metrics

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

The authors introduce the XAI Evaluation Card template to standardize how XAI evaluation metrics are defined, validated, and reported.

NEURON: A Neuro-symbolic System for Grounded Clinical Explainability

cs.AI · 2026-05-02 · unverdicted · novelty 6.0

NEURON raises AUC from 0.74-0.77 to 0.84-0.88 on MIMIC-IV heart-failure mortality prediction while lifting human-aligned explanation scores from 0.50 to 0.85 by grounding SHAP values in SNOMED CT and patient notes via RAG-LLM.

Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

In high-stakes settings, Shapley explanations increase analyst confidence but do not improve decision accuracy, and standard metrics fail to predict human utility.

Design Guidelines for Game-Based Refresher Training of Community Health Workers in Low-Resource Contexts

cs.HC · 2026-04-06 · unverdicted · novelty 6.0

A four-year mixed-methods study of game-based systems for Indian CHWs yields eight design guidelines for sustained engagement, learning transfer, and contextual appropriateness in low-resource health training.

X-SYS: A Reference Architecture for Interactive Explanation Systems

cs.AI · 2026-02-13 · unverdicted · novelty 6.0

X-SYS is a reference architecture for interactive explanation systems organized around STAR quality attributes and five service components, demonstrated via SemanticLens for vision-language models.

Faster Verified Explanations for Neural Networks

cs.LG · 2025-11-28 · unverdicted · novelty 6.0

FaVeX accelerates verified explanations for neural networks via dynamic batch-sequential processing and query reuse while introducing verifier-optimal robust explanations that incorporate verifier incompleteness.

citing papers explorer

Showing 24 of 74 citing papers.

Interpretable Question Answering on Knowledge Bases and Text cs.CL · 2019-06-26 · unverdicted · none · ref 7 · internal anchor
Compares LIME, input perturbation and attention for explaining QA on KB+text; proposes automatic evaluation paradigm and finds input perturbation superior in both automatic and human studies.
Artificial Adaptive Intelligence: The Missing Stage Between Narrow and General Intelligence cs.AI · 2026-05-16 · unverdicted · none · ref 6 · internal anchor
Proposes Artificial Adaptive Intelligence as the regime between narrow and general AI, defined by elimination of human-specified hyperparameters, and introduces an adaptivity index plus parametric minimality principle grounded in minimum description length.
Explanation-Aware Learning for Enhanced Interpretability in Biomedical Imaging cs.CV · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
Adding explanation supervision to training improves spatial alignment of saliency maps with clinical annotations on chest X-rays while keeping predictive accuracy comparable.
LLMs Should Not Yet Be Credited with Decision Explanation cs.AI · 2026-05-01 · unverdicted · none · ref 30 · internal anchor
LLMs support decision prediction and rationale generation but lack evidence for genuine decision explanation, requiring stricter standards to avoid over-crediting.
From Trust to Appropriate Reliance: Measurement Constructs in Human-AI Decision-Making cs.HC · 2026-04-26 · unverdicted · none · ref 24 · internal anchor
A literature review shows that constructs for appropriate reliance on AI are fragmented, presents three views on the topic, and calls for consensus on objective metrics to enable better comparisons across studies.
Out of Context: Reliability in Multimodal Anomaly Detection Requires Contextual Inference cs.LG · 2026-04-14 · unverdicted · none · ref 21 · internal anchor
Multimodal anomaly detection must be reframed as cross-modal contextual inference that separates context from observations to define abnormality conditionally.
Explainable Human Activity Recognition: A Unified Review of Concepts and Mechanisms cs.LG · 2026-04-10 · unverdicted · none · ref 8 · internal anchor
The paper delivers a mechanism-centric taxonomy and unified perspective on explainable human activity recognition methods across sensing modalities.
Event-Centric World Modeling with Memory-Augmented Retrieval for Embodied Decision-Making cs.LG · 2026-04-08 · unverdicted · none · ref 6 · internal anchor
An event-centric framework encodes environments as semantic events and retrieves weighted prior maneuvers from a knowledge bank to enable interpretable, physics-aware decision-making for UAVs.
Explainability Methods for Hardware Trojan Detection: A Systematic Comparison cs.LG · 2026-01-26 · unverdicted · none · ref 20 · internal anchor
Compares domain-aware, case-based, and feature attribution explainability methods for gate-level hardware Trojan detection on the Trust-Hub benchmark dataset.
DenoGrad: A Gradient-Based Framework for Data Refinement in Tabular and Time-Series Learning cs.AI · 2025-11-13 · unverdicted · none · ref 28 · internal anchor
DenoGrad refines noisy tabular and time-series data by optimizing inputs via gradients from a fixed model, yielding better downstream predictions on ten real-world datasets while preserving data statistics.
Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health cs.LG · 2024-10-12 · unverdicted · none · ref 15 · internal anchor
AIMEN trains an ensemble of neural networks on CTGAN-augmented data to predict adverse labor outcomes at 0.784 F1 and produces sparse counterfactual explanations identifying changes in two to three attributes.
Evaluating Physician-AI Interaction for Cancer Management: Paving the Path towards Precision Oncology cs.HC · 2024-04-23 · unverdicted · none · ref 36 · internal anchor
A within-subjects study with 32 physicians using a web-based CDSS for 12 synthetic multiple myeloma scenarios found over-reliance on ML outputs when discordant with RCT evidence and poor retention of model validation details.
A Blueprint for AI-Driven Software Quality: Integrating LLMs with Established Standards cs.SE · 2025-05-19 · unverdicted · none · ref 252 · internal anchor
Survey mapping LLM applications in software quality assurance to established standards including ISO/IEC 12207, ISO 25010, CMMI, and TMM, with case studies, challenges, and future directions.
The Role of Cooperation in Responsible AI Development cs.CY · 2019-07-10 · unverdicted · none · ref 11 · internal anchor
Competitive pressures in AI development create collective action problems that may require industry cooperation, with key factors and strategies identified to enable responsible outcomes.
The Mass, Fake News, and Cognition Security cs.CY · 2019-07-09 · unverdicted · none · ref 150 · internal anchor
The paper defines Cognition Security (CogSec) as a multidisciplinary field studying cognitive impacts of fake news and outlines research challenges, techniques, and future directions.
Unexplainability and Incomprehensibility of Artificial Intelligence cs.CY · 2019-06-20 · unverdicted · none · ref 38 · internal anchor
Advanced AI systems are unexplainable in full and produce explanations that humans cannot comprehend.
The frame problem in quantitative practice: ontological uncertainty and epistemic humility in an age of automated inference stat.ME · 2026-05-22 · unverdicted · none · ref 12 · internal anchor
A synthetic review arguing that frame (ontological) uncertainty is structurally invisible within quantitative models and drives most consequential failures in automated inference.
On the Semantic Interpretability of Artificial Intelligence Models cs.AI · 2019-07-09 · unverdicted · none · ref 6 · internal anchor
This survey classifies semantic interpretability methods in AI models by nature and feature introduction, reviews user impact, and identifies remaining gaps.
Efficacy Analysis in Clinical Trials: A Comprehensive Review of Statistical and Machine Learning Approaches stat.OT · 2025-11-07 · unverdicted · none · ref 158 · internal anchor
A review summarizing parametric, nonparametric, Bayesian, and machine learning methods for efficacy analysis in clinical trials and identifying gaps such as high-dimensional data and missingness.
I-SAFE: Wasserstein Coherence Metrics for Structural Auditing of Scientific AI Models cs.LG · 2026-05-20 · unreviewed · ref 9 · internal anchor
CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models cs.CL · 2026-05-19 · unreviewed · ref 9 · internal anchor
Towards interpretable AI with quantum annealing feature selection cs.LG · 2026-04-28 · unreviewed · ref 16 · internal anchor
From Attribution to Action: A Human-Centered Application of Activation Steering cs.AI · 2026-04-13 · unreviewed · ref 14 · internal anchor
Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models cs.LG · 2025-08-06 · unreviewed · ref 26 · internal anchor

Towards A Rigorous Science of Interpretable Machine Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer