hub

Towards A Rigorous Science of Interpretable Machine Learning

Finale Doshi-Velez, Been Kim · 2017 · stat.ML · arXiv 1702.08608

28 Pith papers cite this work. Polarity classification is still indexing.

28 Pith papers citing it

open full Pith review browse 28 citing papers arXiv PDF

abstract

As machine learning systems become ubiquitous, there has been a surge of interest in interpretable machine learning: systems that provide explanation for their outputs. These explanations are often used to qualitatively assess other criteria such as safety or non-discrimination. However, despite the interest in interpretability, there is very little consensus on what interpretable machine learning is and how it should be measured. In this position paper, we first define interpretability and describe when interpretability is needed (and when it is not). Next, we suggest a taxonomy for rigorous evaluation and expose open questions towards a more rigorous science of interpretable machine learning.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Agentic-imodels: Evolving agentic interpretability tools via autoresearch

cs.AI · 2026-05-05 · unverdicted · novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

ISAAC: Auditing Causal Reasoning in Deep Models for Drug-Target Interaction

cs.LG · 2026-05-03 · unverdicted · novelty 7.0

ISAAC auditing applied to three DTI models on the Davis benchmark finds 25% relative differences in causal reasoning scores despite nearly identical AUROC values.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Interpretability Can Be Actionable

cs.LG · 2026-05-11 · conditional · novelty 6.0

Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.

Evaluating the False Trust engendered by LLM Explanations

cs.HC · 2026-05-11 · unverdicted · novelty 6.0

A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.

The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.

ShifaMind: A Multiplicative Concept Bottleneck for Interpretable ICD-10 Coding

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

ShifaMind achieves competitive performance with the LAAT baseline on MIMIC-IV top-50 ICD-10 coding while outperforming vanilla concept bottleneck models and providing concept-mediated explanations.

Evaluation Cards for XAI Metrics

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

The authors introduce the XAI Evaluation Card template to standardize how XAI evaluation metrics are defined, validated, and reported.

NEURON: A Neuro-symbolic System for Grounded Clinical Explainability

cs.AI · 2026-05-02 · unverdicted · novelty 6.0

NEURON raises AUC from 0.74-0.77 to 0.84-0.88 on MIMIC-IV heart-failure mortality prediction while lifting human-aligned explanation scores from 0.50 to 0.85 by grounding SHAP values in SNOMED CT and patient notes via RAG-LLM.

Towards interpretable AI with quantum annealing feature selection

cs.LG · 2026-04-28 · unverdicted · novelty 6.0

Quantum annealing solves a combinatorial optimization problem to select key CNN feature maps, yielding more class-disentangled explanations than GradCAM or GradCAM++.

Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

In high-stakes settings, Shapley explanations increase analyst confidence but do not improve decision accuracy, and standard metrics fail to predict human utility.

From Attribution to Action: A Human-Centered Application of Activation Steering

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

Activation steering paired with attribution enables intervention-based debugging in vision models, as all 8 interviewed experts shifted to hypothesis testing, most trusted observed responses, and highlighted risks like ripple effects.

Design Guidelines for Game-Based Refresher Training of Community Health Workers in Low-Resource Contexts

cs.HC · 2026-04-06 · unverdicted · novelty 6.0

A four-year mixed-methods study of game-based systems for Indian CHWs yields eight design guidelines for sustained engagement, learning transfer, and contextual appropriateness in low-resource health training.

Ethical and social risks of harm from Language Models

cs.CL · 2021-12-08 · accept · novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.

SINAPSE: A lightweight deep learning framework for accurate and explainable neutron-$\gamma$ discrimination

physics.ins-det · 2026-05-13 · unverdicted · novelty 5.0

SINAPSE uses a dual-branch neural network with a 1D convolutional autoencoder for denoising and a classifier for neutron-gamma discrimination, trained via random augmentations on high-SNR data and validated with SHAP explanations.

NeuroViz: Real-time Interactive Visualization of Forward and Backward Passes in Neural Network Training

cs.LG · 2026-05-03 · unverdicted · novelty 5.0

NeuroViz offers interactive real-time visualization of neural network forward and backward passes, achieving top usability scores in a study with 31 participants compared to existing tools.

CoAX: Cognitive-Oriented Attribution eXplanation User Model of Human Understanding of AI Explanations

cs.AI · 2026-04-30 · unverdicted · novelty 5.0

Cognitive models of user reasoning strategies with XAI methods on tabular data fit human forward-simulation decisions better than ML baselines and support hypothesis testing without new user studies.

X-NegoBox: An Explainable Privacy-Budget Negotiation Framework for Secure Peer-to-Peer Energy Data Exchange

cs.CR · 2026-04-27 · unverdicted · novelty 5.0

X-NegoBox is a proposed explainable framework that negotiates privacy budgets for energy data exchange using trust, sensitivity, and purpose factors, with experiments claiming reduced leakage and higher acceptance rates.

From Awareness to Intent: Mitigating Silent Driving System Failures through Prospective Situation Awareness Enhancing Interfaces

cs.HC · 2026-04-20 · unverdicted · novelty 5.0

Prospective situation awareness enhancing interfaces delivered via AR HUD improve takeover performance after silent automation failures, with perceptual cues most effective at raising situational awareness and system-intent messages best at building trust.

Domain-Specialized Object Detection via Model-Level Mixtures of Experts

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

Model-level MoE of domain-specialized YOLO detectors with gating network outperforms standard ensembles on BDD100K while revealing expert specialization.

Interpretable and Explainable Surrogate Modeling for Simulations: A State-of-the-Art Survey and Perspectives on Explainable AI for Decision-Making

cs.AI · 2026-04-15 · unverdicted · novelty 5.0

This survey synthesizes XAI methods with surrogate modeling workflows for simulations and outlines a research agenda to embed explainability into simulation-driven design and decision-making.

Governed Reasoning for Institutional AI

cs.AI · 2026-04-12 · unverdicted · novelty 5.0

Cognitive Core uses nine typed cognitive primitives, a four-tier governance model with human review as an execution condition, and an endogenous audit ledger to reach 91% accuracy with zero silent errors on prior authorization appeals, outperforming ReAct and Plan-and-Solve baselines.

Explanation-Aware Learning for Enhanced Interpretability in Biomedical Imaging

cs.CV · 2026-05-11 · unverdicted · novelty 4.0

Adding explanation supervision to training improves spatial alignment of saliency maps with clinical annotations on chest X-rays while keeping predictive accuracy comparable.

LLMs Should Not Yet Be Credited with Decision Explanation

cs.AI · 2026-05-01 · unverdicted · novelty 4.0

LLMs support decision prediction and rationale generation but lack evidence for genuine decision explanation, requiring stricter standards to avoid over-crediting.

citing papers explorer

Showing 28 of 28 citing papers.

Agentic-imodels: Evolving agentic interpretability tools via autoresearch cs.AI · 2026-05-05 · unverdicted · none · ref 20 · internal anchor
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
ISAAC: Auditing Causal Reasoning in Deep Models for Drug-Target Interaction cs.LG · 2026-05-03 · unverdicted · none · ref 15 · internal anchor
ISAAC auditing applied to three DTI models on the Davis benchmark finds 25% relative differences in causal reasoning scores despite nearly identical AUROC values.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 175 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Interpretability Can Be Actionable cs.LG · 2026-05-11 · conditional · none · ref 42 · internal anchor
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
Evaluating the False Trust engendered by LLM Explanations cs.HC · 2026-05-11 · unverdicted · none · ref 6 · internal anchor
A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime cs.AI · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.
ShifaMind: A Multiplicative Concept Bottleneck for Interpretable ICD-10 Coding cs.LG · 2026-05-08 · unverdicted · none · ref 4 · internal anchor
ShifaMind achieves competitive performance with the LAAT baseline on MIMIC-IV top-50 ICD-10 coding while outperforming vanilla concept bottleneck models and providing concept-mediated explanations.
Evaluation Cards for XAI Metrics cs.CV · 2026-05-06 · unverdicted · none · ref 4 · internal anchor
The authors introduce the XAI Evaluation Card template to standardize how XAI evaluation metrics are defined, validated, and reported.
NEURON: A Neuro-symbolic System for Grounded Clinical Explainability cs.AI · 2026-05-02 · unverdicted · none · ref 27 · internal anchor
NEURON raises AUC from 0.74-0.77 to 0.84-0.88 on MIMIC-IV heart-failure mortality prediction while lifting human-aligned explanation scores from 0.50 to 0.85 by grounding SHAP values in SNOMED CT and patient notes via RAG-LLM.
Towards interpretable AI with quantum annealing feature selection cs.LG · 2026-04-28 · unverdicted · none · ref 16 · internal anchor
Quantum annealing solves a combinatorial optimization problem to select key CNN feature maps, yielding more class-disentangled explanations than GradCAM or GradCAM++.
Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings cs.LG · 2026-04-24 · unverdicted · none · ref 20 · internal anchor
In high-stakes settings, Shapley explanations increase analyst confidence but do not improve decision accuracy, and standard metrics fail to predict human utility.
From Attribution to Action: A Human-Centered Application of Activation Steering cs.AI · 2026-04-13 · unverdicted · none · ref 14 · internal anchor
Activation steering paired with attribution enables intervention-based debugging in vision models, as all 8 interviewed experts shifted to hypothesis testing, most trusted observed responses, and highlighted risks like ripple effects.
Design Guidelines for Game-Based Refresher Training of Community Health Workers in Low-Resource Contexts cs.HC · 2026-04-06 · unverdicted · none · ref 9 · internal anchor
A four-year mixed-methods study of game-based systems for Indian CHWs yields eight design guidelines for sustained engagement, learning transfer, and contextual appropriateness in low-resource health training.
Ethical and social risks of harm from Language Models cs.CL · 2021-12-08 · accept · none · ref 70 · internal anchor
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
SINAPSE: A lightweight deep learning framework for accurate and explainable neutron-$\gamma$ discrimination physics.ins-det · 2026-05-13 · unverdicted · none · ref 24 · internal anchor
SINAPSE uses a dual-branch neural network with a 1D convolutional autoencoder for denoising and a classifier for neutron-gamma discrimination, trained via random augmentations on high-SNR data and validated with SHAP explanations.
NeuroViz: Real-time Interactive Visualization of Forward and Backward Passes in Neural Network Training cs.LG · 2026-05-03 · unverdicted · none · ref 10 · internal anchor
NeuroViz offers interactive real-time visualization of neural network forward and backward passes, achieving top usability scores in a study with 31 participants compared to existing tools.
CoAX: Cognitive-Oriented Attribution eXplanation User Model of Human Understanding of AI Explanations cs.AI · 2026-04-30 · unverdicted · none · ref 28 · internal anchor
Cognitive models of user reasoning strategies with XAI methods on tabular data fit human forward-simulation decisions better than ML baselines and support hypothesis testing without new user studies.
X-NegoBox: An Explainable Privacy-Budget Negotiation Framework for Secure Peer-to-Peer Energy Data Exchange cs.CR · 2026-04-27 · unverdicted · none · ref 21 · internal anchor
X-NegoBox is a proposed explainable framework that negotiates privacy budgets for energy data exchange using trust, sensitivity, and purpose factors, with experiments claiming reduced leakage and higher acceptance rates.
From Awareness to Intent: Mitigating Silent Driving System Failures through Prospective Situation Awareness Enhancing Interfaces cs.HC · 2026-04-20 · unverdicted · none · ref 31 · internal anchor
Prospective situation awareness enhancing interfaces delivered via AR HUD improve takeover performance after silent automation failures, with perceptual cues most effective at raising situational awareness and system-intent messages best at building trust.
Domain-Specialized Object Detection via Model-Level Mixtures of Experts cs.CV · 2026-04-20 · unverdicted · none · ref 12 · internal anchor
Model-level MoE of domain-specialized YOLO detectors with gating network outperforms standard ensembles on BDD100K while revealing expert specialization.
Interpretable and Explainable Surrogate Modeling for Simulations: A State-of-the-Art Survey and Perspectives on Explainable AI for Decision-Making cs.AI · 2026-04-15 · unverdicted · none · ref 147 · internal anchor
This survey synthesizes XAI methods with surrogate modeling workflows for simulations and outlines a research agenda to embed explainability into simulation-driven design and decision-making.
Governed Reasoning for Institutional AI cs.AI · 2026-04-12 · unverdicted · none · ref 3 · internal anchor
Cognitive Core uses nine typed cognitive primitives, a four-tier governance model with human review as an execution condition, and an endogenous audit ledger to reach 91% accuracy with zero silent errors on prior authorization appeals, outperforming ReAct and Plan-and-Solve baselines.
Explanation-Aware Learning for Enhanced Interpretability in Biomedical Imaging cs.CV · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
Adding explanation supervision to training improves spatial alignment of saliency maps with clinical annotations on chest X-rays while keeping predictive accuracy comparable.
LLMs Should Not Yet Be Credited with Decision Explanation cs.AI · 2026-05-01 · unverdicted · none · ref 30 · internal anchor
LLMs support decision prediction and rationale generation but lack evidence for genuine decision explanation, requiring stricter standards to avoid over-crediting.
From Trust to Appropriate Reliance: Measurement Constructs in Human-AI Decision-Making cs.HC · 2026-04-26 · unverdicted · none · ref 24 · internal anchor
A literature review shows that constructs for appropriate reliance on AI are fragmented, presents three views on the topic, and calls for consensus on objective metrics to enable better comparisons across studies.
Out of Context: Reliability in Multimodal Anomaly Detection Requires Contextual Inference cs.LG · 2026-04-14 · unverdicted · none · ref 21 · internal anchor
Multimodal anomaly detection must be reframed as cross-modal contextual inference that separates context from observations to define abnormality conditionally.
Explainable Human Activity Recognition: A Unified Review of Concepts and Mechanisms cs.LG · 2026-04-10 · unverdicted · none · ref 8 · internal anchor
The paper delivers a mechanism-centric taxonomy and unified perspective on explainable human activity recognition methods across sensing modalities.
Event-Centric World Modeling with Memory-Augmented Retrieval for Embodied Decision-Making cs.LG · 2026-04-08 · unverdicted · none · ref 6 · internal anchor
An event-centric framework encodes environments as semantic events and retrieves weighted prior maneuvers from a knowledge bank to enable interpretable, physics-aware decision-making for UAVs.

Towards A Rigorous Science of Interpretable Machine Learning

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer