Recognition: 2 theorem links
· Lean TheoremTowards A Rigorous Science of Interpretable Machine Learning
Pith reviewed 2026-05-12 12:00 UTC · model grok-4.3
The pith
Interpretability in machine learning lacks a shared definition and way to measure it, so this position paper supplies both along with guidance on when explanations are actually required.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors state that despite rising use of explanations to assess safety and non-discrimination, interpretable machine learning has no consensus definition and no agreed evaluation standards. They define interpretability as the capacity to explain or present in understandable terms to a human. They specify when such explanations are needed and when they are not. They then offer a taxonomy that structures evaluation approaches and they enumerate open questions that must be resolved to place the area on a more scientific footing.
What carries the argument
The taxonomy for rigorous evaluation, which groups methods according to the form of explanation provided and the property being verified such as safety or fairness.
If this is right
- Different interpretable models can be compared using the same set of evaluation criteria.
- Studies that test explanations will produce results that can be reproduced by other groups.
- Developers will know more precisely when to add explanations and when simpler black-box models suffice.
- Research effort will shift toward answering the listed open questions with measurable experiments.
- Assessments of safety or non-discrimination will rest on evaluation procedures that can be scrutinized.
Where Pith is reading between the lines
- The taxonomy could be turned into reporting guidelines that conferences require for papers claiming interpretability benefits.
- Empirical tests could apply the taxonomy to several existing explanation methods on the same dataset to check whether quality rankings remain stable.
- Links to human-subject studies would help determine which explanation formats are understandable to domain experts versus lay users.
- The same structure might later be used to evaluate interpretability claims in sequential decision systems such as reinforcement learning agents.
Load-bearing premise
That a common definition together with this taxonomy will produce enough agreement for evaluations of interpretability to become consistent and reproducible.
What would settle it
A later survey of published work that continues to employ incompatible definitions of interpretability and never references the proposed taxonomy would indicate that the offered structure has not created the expected consensus.
read the original abstract
As machine learning systems become ubiquitous, there has been a surge of interest in interpretable machine learning: systems that provide explanation for their outputs. These explanations are often used to qualitatively assess other criteria such as safety or non-discrimination. However, despite the interest in interpretability, there is very little consensus on what interpretable machine learning is and how it should be measured. In this position paper, we first define interpretability and describe when interpretability is needed (and when it is not). Next, we suggest a taxonomy for rigorous evaluation and expose open questions towards a more rigorous science of interpretable machine learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This position paper argues that despite surging interest in interpretable machine learning, there remains little consensus on its definition or measurement. The authors first supply a working definition of interpretability and delineate the conditions under which it is required (versus when it is not). They then outline a high-level taxonomy of evaluation approaches and enumerate open questions intended to move the field toward more rigorous, scientific practices.
Significance. If the supplied definition and taxonomy are taken up by the community, the work could function as a useful organizing framework that reduces ad-hoc explanation practices and focuses attention on measurable evaluation criteria. Its primary contribution is conceptual and taxonomic rather than empirical; the explicit listing of open questions is a constructive acknowledgment that the proposed structure is a starting point rather than a finished protocol. The paper therefore earns credit for clarity of framing and for avoiding over-claim.
minor comments (3)
- [Section 4] Section 4 (Taxonomy): the four evaluation categories are introduced at a high level of abstraction; adding one or two concrete examples or references for each category would make the taxonomy more immediately usable for readers seeking to apply it.
- [Abstract] Abstract: the abstract states that a taxonomy is suggested but does not name or briefly characterize its main branches, which reduces the ability of a quick reader to grasp the paper's central organizational contribution.
- [Section 5] Section 5 (Open Questions): several questions are listed as standalone items; grouping them thematically (e.g., under 'evaluation metrics,' 'human-subject protocols,' 'application domains') would improve readability and signal priorities.
Simulated Author's Rebuttal
We thank the referee for the careful reading and positive assessment of our position paper. The summary accurately reflects our goals: to supply a working definition of interpretability, clarify when it is (and is not) required, and propose a taxonomy for its evaluation while highlighting open research questions. We appreciate the recognition that the contribution is primarily conceptual and taxonomic. Given the recommendation for minor revision and the absence of specific major comments, we will incorporate minor editorial improvements for clarity and flow in the revised manuscript.
Circularity Check
No significant circularity identified
full rationale
This position paper supplies a working definition of interpretability, conditions for its use, a high-level taxonomy of evaluation methods, and a list of open questions. It contains no mathematical derivations, fitted parameters, predictions, or equations that could reduce to prior quantities by construction. The central contribution is explicitly framed as definitional and taxonomic rather than as a completed derivation or empirical result; the text acknowledges remaining gaps and does not invoke self-citations or uniqueness theorems as load-bearing premises. Consequently the argument is self-contained against external benchmarks and exhibits none of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Explanations from ML systems can be used to qualitatively assess criteria such as safety or non-discrimination
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclearthere is very little consensus on what interpretable machine learning is and how it should be measured. In this position paper, we first define interpretability and describe when interpretability is needed (and when it is not). Next, we suggest a taxonomy for rigorous evaluation
Forward citations
Cited by 28 Pith papers
-
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
-
ISAAC: Auditing Causal Reasoning in Deep Models for Drug-Target Interaction
ISAAC auditing applied to three DTI models on the Davis benchmark finds 25% relative differences in causal reasoning scores despite nearly identical AUROC values.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Interpretability Can Be Actionable
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
-
Evaluating the False Trust engendered by LLM Explanations
A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
-
The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.
-
ShifaMind: A Multiplicative Concept Bottleneck for Interpretable ICD-10 Coding
ShifaMind achieves competitive performance with the LAAT baseline on MIMIC-IV top-50 ICD-10 coding while outperforming vanilla concept bottleneck models and providing concept-mediated explanations.
-
Evaluation Cards for XAI Metrics
The authors introduce the XAI Evaluation Card template to standardize how XAI evaluation metrics are defined, validated, and reported.
-
NEURON: A Neuro-symbolic System for Grounded Clinical Explainability
NEURON raises AUC from 0.74-0.77 to 0.84-0.88 on MIMIC-IV heart-failure mortality prediction while lifting human-aligned explanation scores from 0.50 to 0.85 by grounding SHAP values in SNOMED CT and patient notes via...
-
Towards interpretable AI with quantum annealing feature selection
Quantum annealing solves a combinatorial optimization problem to select key CNN feature maps, yielding more class-disentangled explanations than GradCAM or GradCAM++.
-
Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings
In high-stakes settings, Shapley explanations increase analyst confidence but do not improve decision accuracy, and standard metrics fail to predict human utility.
-
From Attribution to Action: A Human-Centered Application of Activation Steering
Activation steering paired with attribution enables intervention-based debugging in vision models, as all 8 interviewed experts shifted to hypothesis testing, most trusted observed responses, and highlighted risks lik...
-
Design Guidelines for Game-Based Refresher Training of Community Health Workers in Low-Resource Contexts
A four-year mixed-methods study of game-based systems for Indian CHWs yields eight design guidelines for sustained engagement, learning transfer, and contextual appropriateness in low-resource health training.
-
Ethical and social risks of harm from Language Models
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...
-
SINAPSE: A lightweight deep learning framework for accurate and explainable neutron-$\gamma$ discrimination
SINAPSE uses a dual-branch neural network with a 1D convolutional autoencoder for denoising and a classifier for neutron-gamma discrimination, trained via random augmentations on high-SNR data and validated with SHAP ...
-
NeuroViz: Real-time Interactive Visualization of Forward and Backward Passes in Neural Network Training
NeuroViz offers interactive real-time visualization of neural network forward and backward passes, achieving top usability scores in a study with 31 participants compared to existing tools.
-
CoAX: Cognitive-Oriented Attribution eXplanation User Model of Human Understanding of AI Explanations
Cognitive models of user reasoning strategies with XAI methods on tabular data fit human forward-simulation decisions better than ML baselines and support hypothesis testing without new user studies.
-
X-NegoBox: An Explainable Privacy-Budget Negotiation Framework for Secure Peer-to-Peer Energy Data Exchange
X-NegoBox is a proposed explainable framework that negotiates privacy budgets for energy data exchange using trust, sensitivity, and purpose factors, with experiments claiming reduced leakage and higher acceptance rates.
-
From Awareness to Intent: Mitigating Silent Driving System Failures through Prospective Situation Awareness Enhancing Interfaces
Prospective situation awareness enhancing interfaces delivered via AR HUD improve takeover performance after silent automation failures, with perceptual cues most effective at raising situational awareness and system-...
-
Domain-Specialized Object Detection via Model-Level Mixtures of Experts
Model-level MoE of domain-specialized YOLO detectors with gating network outperforms standard ensembles on BDD100K while revealing expert specialization.
-
Interpretable and Explainable Surrogate Modeling for Simulations: A State-of-the-Art Survey and Perspectives on Explainable AI for Decision-Making
This survey synthesizes XAI methods with surrogate modeling workflows for simulations and outlines a research agenda to embed explainability into simulation-driven design and decision-making.
-
Governed Reasoning for Institutional AI
Cognitive Core uses nine typed cognitive primitives, a four-tier governance model with human review as an execution condition, and an endogenous audit ledger to reach 91% accuracy with zero silent errors on prior auth...
-
Explanation-Aware Learning for Enhanced Interpretability in Biomedical Imaging
Adding explanation supervision to training improves spatial alignment of saliency maps with clinical annotations on chest X-rays while keeping predictive accuracy comparable.
-
LLMs Should Not Yet Be Credited with Decision Explanation
LLMs support decision prediction and rationale generation but lack evidence for genuine decision explanation, requiring stricter standards to avoid over-crediting.
-
From Trust to Appropriate Reliance: Measurement Constructs in Human-AI Decision-Making
A literature review shows that constructs for appropriate reliance on AI are fragmented, presents three views on the topic, and calls for consensus on objective metrics to enable better comparisons across studies.
-
Out of Context: Reliability in Multimodal Anomaly Detection Requires Contextual Inference
Multimodal anomaly detection must be reframed as cross-modal contextual inference that separates context from observations to define abnormality conditionally.
-
Explainable Human Activity Recognition: A Unified Review of Concepts and Mechanisms
The paper delivers a mechanism-centric taxonomy and unified perspective on explainable human activity recognition methods across sensing modalities.
-
Event-Centric World Modeling with Memory-Augmented Retrieval for Embodied Decision-Making
An event-centric framework encodes environments as semantic events and retrieves weighted prior maneuvers from a knowledge bank to enable interpretable, physics-aware decision-making for UAVs.
Reference graph
Works this paper leans on
-
[1]
Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man \'e . Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Structuring dimensions for collaborative systems evaluation
Pedro Antunes, Valeria Herskovic, Sergio F Ochoa, and Jose A Pino. Structuring dimensions for collaborative systems evaluation. ACM Computing Surveys, 2012
work page 2012
-
[3]
Explanation: A mechanist alternative
William Bechtel and Adele Abrahamsen. Explanation: A mechanist alternative. Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences, 2005
work page 2005
-
[4]
\ UCI \ repository of machine learning databases
Catherine Blake and Christopher J Merz. \ UCI \ repository of machine learning databases. 1998
work page 1998
-
[5]
The ethics of artificial intelligence
Nick Bostrom and Eliezer Yudkowsky. The ethics of artificial intelligence. The Cambridge Handbook of Artificial Intelligence, 2014
work page 2014
-
[6]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[7]
Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006
work page 2006
-
[8]
Identifying police officers at risk of adverse events
Samuel Carton, Jennifer Helsby, Kenneth Joseph, Ayesha Mahmud, Youngsoo Park, Joe Walsh, Crystal Cody, CPT Estella Patterson, Lauren Haynes, and Rayid Ghani. Identifying police officers at risk of adverse events. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016
work page 2016
-
[9]
Reading tea leaves: How humans interpret topic models
Jonathan Chang, Jordan L Boyd-Graber, Sean Gerrish, Chong Wang, and David M Blei. Reading tea leaves: How humans interpret topic models. In NIPS, 2009
work page 2009
-
[10]
Speculations on human causal learning and reasoning
Nick Chater and Mike Oaksford. Speculations on human causal learning and reasoning. Information sampling and adaptive cognition, 2006
work page 2006
-
[11]
Comorbidity clusters in autism spectrum disorders: an electronic health record time-series analysis
Finale Doshi-Velez, Yaorong Ge, and Isaac Kohane. Comorbidity clusters in autism spectrum disorders: an electronic health record time-series analysis. Pediatrics, 133 0 (1): 0 e54--e63, 2014
work page 2014
-
[12]
Graph-sparse lda: a topic model with structured sparsity
Finale Doshi-Velez, Byron Wallace, and Ryan Adams. Graph-sparse lda: a topic model with structured sparsity. Association for the Advancement of Artificial Intelligence, 2015
work page 2015
-
[13]
Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science Conference. ACM, 2012
work page 2012
-
[14]
Comprehensible classification models: a position paper
Alex Freitas. Comprehensible classification models: a position paper. ACM SIGKDD Explorations, 2014
work page 2014
-
[15]
Meta-unsupervised-learning: A supervised approach to unsupervised learning
Vikas K Garg and Adam Tauman Kalai. Meta-unsupervised-learning: A supervised approach to unsupervised learning. arXiv preprint arXiv:1612.09030, 2016
-
[16]
Rethinking mechanistic explanation
Stuart Glennan. Rethinking mechanistic explanation. Philosophy of science, 2002
work page 2002
-
[17]
Bryce Goodman and Seth Flaxman. European union regulations on algorithmic decision-making and a" right to explanation". arXiv preprint arXiv:1606.08813, 2016
-
[18]
Monotonic calibrated interpolated look-up tables
Maya Gupta, Andrew Cotter, Jan Pfeifer, Konstantin Voevodski, Kevin Canini, Alexander Mangylov, Wojciech Moczydlowski, and Alexander Van Esbroeck. Monotonic calibrated interpolated look-up tables. Journal of Machine Learning Research, 2016
work page 2016
-
[19]
CMU computer won poker battle over humans by statistically significant margin
Sean Hamill. CMU computer won poker battle over humans by statistically significant margin. http://www.post-gazette.com/business/tech-news/2017/01/31/CMU-computer-won-poker-battle-over-humans-by-statistically-significant-margin/stories/201701310250 , 2017. Accessed: 2017-02-07
-
[20]
On the geometry of differential privacy
Moritz Hardt and Kunal Talwar. On the geometry of differential privacy. In ACM Symposium on Theory of Computing. ACM, 2010
work page 2010
-
[21]
Equality of opportunity in supervised learning
Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems, 2016
work page 2016
-
[22]
Studies in the logic of explanation
Carl Hempel and Paul Oppenheim. Studies in the logic of explanation. Philosophy of science, 1948
work page 1948
-
[23]
Complexity measures of supervised classification problems
Tin Kam Ho and Mitra Basu. Complexity measures of supervised classification problems. IEEE transactions on pattern analysis and machine intelligence, 2002
work page 2002
-
[24]
Frank Keil. Explanation and understanding. Annu. Rev. Psychol., 2006
work page 2006
-
[25]
What lies beneath? understanding the limits of understanding
Frank Keil, Leonid Rozenblit, and Candice Mills. What lies beneath? understanding the limits of understanding. Thinking and seeing: Visual metacognition in adults and children, 2004
work page 2004
-
[26]
Been Kim, Caleb Chacha, and Julie Shah. Inferring robot task plans from human team meetings: A generative modeling approach with logic-based prior. Association for the Advancement of Artificial Intelligence, 2013
work page 2013
-
[27]
i BCM : Interactive bayesian case model empowering humans via intuitive interaction
Been Kim, Elena Glassman, Brittney Johnson, and Julie Shah. i BCM : Interactive bayesian case model empowering humans via intuitive interaction. 2015 a
work page 2015
-
[28]
Mind the gap: A generative approach to interpretable feature selection and extraction
Been Kim, Julie Shah, and Finale Doshi-Velez. Mind the gap: A generative approach to interpretable feature selection and extraction. In Advances in Neural Information Processing Systems, 2015 b
work page 2015
-
[29]
Interpretable decision sets: A joint framework for description and prediction
Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1675--1684. ACM, 2016
work page 2016
-
[30]
Research methods in human-computer interaction
Jonathan Lazar, Jinjuan Heidi Feng, and Harry Hochheiser. Research methods in human-computer interaction. John Wiley & Sons, 2010
work page 2010
-
[31]
Rationalizing neural predictions
Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing neural predictions. arXiv preprint arXiv:1606.04155, 2016
-
[32]
The structure and function of explanations
Tania Lombrozo. The structure and function of explanations. Trends in cognitive sciences, 10 0 (10): 0 464--470, 2006
work page 2006
-
[33]
Intelligible models for classification and regression
Yin Lou, Rich Caruana, and Johannes Gehrke. Intelligible models for classification and regression. In ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012
work page 2012
-
[34]
Playing Atari with Deep Reinforcement Learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
- [35]
-
[36]
Safe and interpretable machine learning: A methodological review
Clemens Otte. Safe and interpretable machine learning: A methodological review. In Computational Intelligence in Intelligent Data Analysis. Springer, 2013
work page 2013
-
[37]
General data protection regulation
Parliament and Council of the European Union. General data protection regulation. 2016
work page 2016
-
[38]
``why should i trust you?": Explaining the predictions of any classifier
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ``why should i trust you?": Explaining the predictions of any classifier. arXiv preprint arXiv:1602.04938, 2016
-
[39]
Data mining for discrimination discovery
Salvatore Ruggieri, Dino Pedreschi, and Franco Turini. Data mining for discrimination discovery. ACM Transactions on Knowledge Discovery from Data, 2010
work page 2010
-
[40]
Compositional inductive biases in function learning
Eric Schulz, Joshua Tenenbaum, David Duvenaud, Maarten Speekenbrink, and Samuel Gershman. Compositional inductive biases in function learning. bioRxiv, 2016
work page 2016
-
[41]
Hidden technical debt in machine learning systems
D Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Fran c ois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems, 2015
work page 2015
-
[42]
Mastering the game of go with deep neural networks and tree search
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016
work page 2016
-
[43]
Privacy versus antidiscrimination
Lior Jacob Strahilevitz. Privacy versus antidiscrimination. University of Chicago Law School Working Paper, 2008
work page 2008
-
[44]
Automatic neural reconstruction from petavoxel of electron microscopy data
Adi Suissa-Peleg, Daniel Haehn, Seymour Knowles-Barley, Verena Kaynig, Thouis R Jones, Alyssa Wilson, Richard Schalek, Jeffery W Lichtman, and Hanspeter Pfister. Automatic neural reconstruction from petavoxel of electron microscopy data. Microscopy and Microanalysis, 2016
work page 2016
-
[45]
Adnostic: Privacy preserving targeted advertising
Vincent Toubiana, Arvind Narayanan, Dan Boneh, Helen Nissenbaum, and Solon Barocas. Adnostic: Privacy preserving targeted advertising. 2010
work page 2010
-
[46]
Openml: networked science in machine learning
Joaquin Vanschoren, Jan N Van Rijn, Bernd Bischl, and Luis Torgo. Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15 0 (2): 0 49--60, 2014
work page 2014
-
[47]
On the safety of machine learning: Cyber-physical systems, decision sciences, and data products
Kush Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical systems, decision sciences, and data products. CoRR, 2016
work page 2016
-
[48]
Fulton Wang and Cynthia Rudin. Falling rule lists. In AISTATS, 2015
work page 2015
-
[49]
Bayesian rule sets for interpretable classification
Tong Wang, Cynthia Rudin, Finale Doshi-Velez, Yimin Liu, Erica Klampfl, and Perry MacNeille. Bayesian rule sets for interpretable classification. In International Conference on Data Mining, 2017
work page 2017
-
[50]
Axis: Generating explanations at scale with learnersourcing and machine learning
Joseph Jay Williams, Juho Kim, Anna Rafferty, Samuel Maldonado, Krzysztof Z Gajos, Walter S Lasecki, and Neil Heffernan. Axis: Generating explanations at scale with learnersourcing and machine learning. In ACM Conference on Learning@ Scale. ACM, 2016
work page 2016
-
[51]
Andrew Wilson, Christoph Dann, Chris Lucas, and Eric Xing. The human kernel. In Advances in Neural Information Processing Systems, 2015
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.