Recognition: unknown
Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings
Pith reviewed 2026-05-08 12:16 UTC · model grok-4.3
The pith
Standard quantitative metrics for Shapley explanations do not predict human clarity or decision quality in high-stakes settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A unified amortized framework isolates semantic differences among Shapley variants; across risk datasets and 3,735 professional analyst reviews, quantitative proxies such as sparsity and faithfulness prove decoupled from perceived clarity and decision utility, while explanations raise analyst confidence without improving objective performance and thereby create a measurable risk of automation bias.
What carries the argument
The unified amortized framework that isolates semantic differences between eight Shapley variants while respecting low-latency operational constraints.
If this is right
- Current quantitative proxies are insufficient to predict the downstream human impact of an explanation method.
- Explanations can increase decision confidence without raising objective accuracy, creating a documented risk of automation bias.
- Selection of Shapley formulations and evaluation metrics for operational systems requires evidence tied to human outcomes rather than proxy scores alone.
Where Pith is reading between the lines
- New evaluation protocols may need direct measures of decision utility collected from domain experts rather than post-hoc proxy calculations.
- Similar human audits in medicine or autonomous systems could uncover parallel over patterns.
- Interface designs that surface uncertainty information alongside explanations might reduce the observed confidence inflation.
Load-bearing premise
The 3,735 analyst case reviews provide a representative proxy for real high-stakes operational risk workflows without major selection or contextual biases.
What would settle it
A controlled follow-up deployment in which analysts given Shapley explanations show measurably higher decision accuracy than a no-explanation control group on the same live risk cases.
Figures
read the original abstract
Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. In this work, we use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Our results reveal a fundamental misalignment: standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Furthermore, while no formulation improved objective analyst performance, explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. These findings suggest that current evaluation proxies are insufficient for predicting downstream human impact, and we provide evidence-based guidance for selecting formulations and metrics in operational decision systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard quantitative metrics for evaluating Shapley value explanations (e.g., sparsity and faithfulness) are decoupled from human-perceived clarity and decision utility in high-stakes settings. Using a unified amortized framework to compare eight Shapley variants, the authors conduct a large empirical study across four risk datasets and a fraud-detection environment with professional analysts performing 3,735 case reviews. Key findings are that no formulation improved objective analyst performance but all increased decision confidence (indicating automation bias risk), and that current proxy metrics are insufficient for predicting downstream human impact.
Significance. If the central empirical results hold, the work is significant for XAI research because it supplies direct human-subject evidence that challenges reliance on proxy-based benchmarks for Shapley methods. The scale of the study (professional analysts, realistic operational workflow, multiple datasets) and the explicit demonstration of misalignment between quantitative proxies and human utility provide actionable guidance for deployment in risk-sensitive domains. The reproducible human-study design and falsifiable claims about automation bias are particular strengths.
minor comments (3)
- The unified amortized framework is central to isolating semantic differences among the eight variants; a concise table or diagram explicitly mapping each variant's key modeling choices (e.g., reference distribution, coalition sampling) to the observed human outcomes would improve readability.
- Section describing the 3,735 case reviews: the manuscript should report effect sizes and confidence intervals alongside p-values for the confidence-increase finding to allow readers to judge the practical magnitude of the automation-bias signal.
- The discussion of metric decoupling would benefit from an explicit statement of the correlation coefficients (or lack thereof) between each quantitative proxy and the human clarity/utility scores, rather than qualitative description alone.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the work, recognition of its significance for XAI evaluation, and recommendation for minor revision. The emphasis on the scale of the human-subject study with professional analysts and the demonstration of misalignment between proxy metrics and human utility is appreciated. No major comments were enumerated in the report.
Circularity Check
No significant circularity in empirical human-subject study
full rationale
This is a purely empirical comparative study that evaluates eight Shapley variants via a unified amortized framework and a controlled experiment with 3,735 professional analyst reviews across four datasets. No mathematical derivation chain, fitted-parameter predictions, or self-referential equations exist that could reduce outputs to inputs by construction. The central claims (misalignment between quantitative proxies and human utility, plus automation-bias signal) rest directly on observed human responses and statistical comparisons, which are externally falsifiable. The framework is presented as a methodological tool with explicit design choices rather than a self-defining ansatz or uniqueness theorem. No load-bearing self-citations or renamings of known results appear in the abstract or described structure. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions in human-subject experiments such as representative sampling and unbiased analyst responses.
Reference graph
Works this paper leans on
-
[1]
Kjersti Aas, Martin Jullum, and Anders Løland. 2021. Explaining individual predictions when features are dependent: More accurate approximations to Shapley values.Artificial Intelligence298 (2021), 103502
2021
-
[2]
Marzia Ahmed, Mohammod Abul Kashem, Mostafijur Rahman, and Sabira Khatun. 2020. Review and Analysis of Risk Factor of Maternal Health in Remote Oliveira e Silva et al. Area Using the Internet of Things (IoT). InInECCE2019. Springer Singapore, 357–365
2020
-
[3]
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Frame- work. InProceedings of the 25th ACM SIGKDD International Conference on Knowl- edge Discovery & Data Mining. Association for Computing Machinery, 9 pages
2019
-
[4]
Emanuele Albini, Jason Long, Danial Dervovic, and Daniele Magazzeni. 2022. Counterfactual Shapley Additive Explanations. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22). Association for Computing Machinery, 1054–1070
2022
-
[5]
Alonso- Moral, Roberto Confalonieri, Riccardo Guidotti, Javier Del Ser, Natalia Díaz- Rodríguez, and Francisco Herrera
Sajid Ali, Tamer Abuhmed, Shaker El-Sappagh, Khan Muhammad, Jose M. Alonso- Moral, Roberto Confalonieri, Riccardo Guidotti, Javier Del Ser, Natalia Díaz- Rodríguez, and Francisco Herrera. 2023. Explainable Artificial Intelligence (XAI): What we know and what is left to attain Trustworthy Artificial Intelligence. Information Fusion99 (2023), 101805
2023
-
[6]
Rodolfa, Sérgio Jesus, Valerie Chen, Vladimir Bal- ayan, Pedro Saleiro, Pedro Bizarro, Ameet Talwalkar, and Rayid Ghani
Kasun Amarasinghe, Kit T. Rodolfa, Sérgio Jesus, Valerie Chen, Vladimir Bal- ayan, Pedro Saleiro, Pedro Bizarro, Ameet Talwalkar, and Rayid Ghani. 2024. On the Importance of Application-Grounded Experimental Design for Evaluat- ing Explainable ML Methods.Proceedings of the AAAI Conference on Artificial Intelligence38 (2024), 20921–20929
2024
-
[7]
Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. 2021. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. In Proceedings of the 2021 CHI conference on human factors in computing systems. 1–16
2021
-
[8]
Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir Puri, José M. F. Moura, and Peter Eckersley. 2020. Explainable machine learning in deployment. InProceedings of the 2020 Confer- ence on Fairness, Accountability, and Transparency. Association for Computing Machinery, 648–657
2020
-
[9]
Michael Bücker, Gero Szepannek, Alicja Gosiewska, and Przemyslaw Biecek
-
[10]
Transparency, auditability, and explainability of machine learning models in credit scoring.Journal of the Operational Research Society73, 1 (2022), 70–90
2022
-
[11]
Dulce Canha, Sylvain Kubler, Kary Främling, and Guy Fagherazzi. 2025. A Functionally-Grounded Benchmark Framework for XAI Methods: Insights and Foundations from a Systematic Literature Review.ACM Comput. Surv.57, 12, Article 320 (2025), 40 pages
2025
-
[12]
Covert, Scott M
Hugh Chen, Ian C. Covert, Scott M. Lundberg, and Su-In Lee. 2023. Algorithms to estimate Shapley value feature attributions.Nature Machine Intelligence5, 6 (2023), 590–601
2023
- [13]
-
[14]
Ian Covert, Chanwoo Kim, Su-In Lee, James Zou, and Tatsunori Hashimoto
-
[15]
InProceedings of the 38th International Conference on Neural Information Processing Systems (NIPS ’24)
Stochastic amortization: a unified approach to accelerate feature and data attribution. InProceedings of the 38th International Conference on Neural Information Processing Systems (NIPS ’24). Curran Associates Inc., Article 143, 50 pages
-
[16]
Ian Covert and Su-In Lee. 2021. Improving KernelSHAP: Practical Shapley Value Estimation Using Linear Regression. InProceedings of The 24th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 130). PMLR, 3457–3465
2021
-
[17]
Ian Covert, Scott Lundberg, and Su-In Lee. 2021. Explaining by Removing: A Unified Framework for Model Explanation.Journal of Machine Learning Research 22, 209 (2021), 1–90
2021
-
[18]
Anupam Datta, Shayak Sen, and Yair Zick. 2016. Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems. In2016 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 598–617
2016
-
[19]
Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. 2021. Retiring Adult: New Datasets for Fair Machine Learning. InAdvances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 6478–6490
2021
-
[20]
Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of Inter- pretable Machine Learning. arXiv:1702.08608 [stat.ML]
work page internal anchor Pith review arXiv 2017
-
[21]
Joel Dyer, Nicholas Bishop, Yorgos Felekis, Fabio Massimo Zennaro, Anisoara Calinescu, Theodoros Damoulas, and Michael Wooldridge. 2024. Interventionally Consistent Surrogates for Complex Simulation Models. InAdvances in Neural Information Processing Systems, Vol. 37. Curran Associates, Inc., 21814–21841
2024
-
[22]
2025.FICO Expands Educational Analytics Challenge Program with Three New Historically Black Colleges and Universities to Educate Aspiring Data Scientists
FICO. 2025.FICO Expands Educational Analytics Challenge Program with Three New Historically Black Colleges and Universities to Educate Aspiring Data Scientists. FICO. Accessed: 2026-01-02
2025
-
[23]
2012.Applied longitudinal analysis
Garrett M Fitzmaurice, Nan M Laird, and James H Ware. 2012.Applied longitudinal analysis. John Wiley & Sons
2012
-
[24]
Jeremy Goldwasser and Giles Hooker. 2024. Stabilizing Estimates of Shapley Val- ues with Control Variates. InExplainable Artificial Intelligence. Springer Nature Switzerland, 416–439
2024
-
[25]
Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. 2022. Why do tree-based models still outperform deep learning on typical tabular data?Advances in neural information processing systems35 (2022), 507–520
2022
-
[26]
Tom Heskes, Evi Sijben, Ioan Gabriel Bucur, and Tom Claassen. 2020. Causal Shapley Values: Exploiting Causal Knowledge to Explain Individual Predictions of Complex Models. InAdvances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 4778–4789
2020
-
[27]
Hans Hofmann. 1994. Statlog (German Credit Data). UCI Machine Learning Repository
1994
-
[28]
Dominik Janzing, Lenon Minorics, and Patrick Bloebaum. 2020. Feature relevance quantification in explainable AI: A causal problem. InProceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 108). PMLR, 2907–2916
2020
-
[29]
Sérgio Jesus, Catarina Belém, Vladimir Balayan, João Bento, Pedro Saleiro, Pedro Bizarro, and João Gama. 2021. How can I choose an explainer? An Application- grounded Evaluation of Post-hoc Explanations. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Association for Com- puting Machinery, 11 pages
2021
-
[30]
Jorge, Rita P
Sérgio Jesus, Pedro Saleiro, Inês Oliveira e Silva, Beatriz M. Jorge, Rita P. Ribeiro, João Gama, Pedro Bizarro, and Rayid Ghani. 2024. Aequitas Flow: Streamlining Fair ML Experimentation.Journal of Machine Learning Research25 (2024), 1–7
2024
-
[31]
Neil Jethani, Mukund Sudarshan, Ian Connick Covert, Su-In Lee, and Rajesh Ranganath. 2022. FastSHAP: Real-Time Shapley Value Estimation. InInternational Conference on Learning Representations
2022
-
[32]
Amir-Hossein Karimi, Bernhard Schölkopf, and Isabel Valera. 2021. Algorithmic Recourse: from Counterfactual Explanations to Interventions. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21). Association for Computing Machinery, 353–362
2021
-
[33]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. InAdvances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc
2017
-
[34]
AlSaleh, and Amir Mazhar
Farhina Sardar Khan, Syed Shahid Mazhar, Kashif Mazhar, Dhoha A. AlSaleh, and Amir Mazhar. 2025. Model-agnostic explainable artificial intelligence methods in finance: a systematic review, recent developments, limitations, challenges and future directions.Artificial Intelligence Review58, 8 (2025), 232
2025
-
[35]
Consistent Individualized Feature Attribution for Tree Ensembles
Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. 2019. Consistent Individual- ized Feature Attribution for Tree Ensembles. arXiv:1802.03888 [cs.LG]
work page Pith review arXiv 2019
-
[36]
Lundberg and Su-In Lee
Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. InProceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc., 4768–4777
2017
-
[37]
Luke Merrick and Ankur Taly. 2020. The Explanation Game: Explaining Machine Learning Models Using Shapley Values. InInternational Cross-Domain Confer- ence for Machine Learning and Knowledge Extraction. Springer International Publishing, 17–38
2020
-
[38]
Lucas Thibaut Meyer, Marc Schouler, Robert Alexander Caulk, Alejandro Ribes, and Bruno Raffin. 2023. Training Deep Surrogate Models with Large Scale Online Learning. InProceedings of the 40th International Conference on Machine Learning, Vol. 202. PMLR, 24614–24630
2023
-
[39]
Ibomoiye Domor Mienye, George Obaido, Nobert Jere, Ebikella Mienye, Kehinde Aruleba, Ikiomoye Douglas Emmanuel, and Blessing Ogbuokiri. 2024. A survey of explainable artificial intelligence in healthcare: Concepts, applications, and challenges.Informatics in Medicine Unlocked51 (2024), 101587
2024
-
[40]
Christoph Molnar, Giuseppe Casalicchio, and Bernd Bischl. 2020. Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges. InECML PKDD 2020 Workshops. Springer International Publishing, 417–431
2020
-
[41]
Mothilal, Amit Sharma, and Chenhao Tan
Ramaravind K. Mothilal, Amit Sharma, and Chenhao Tan. 2020. Explaining machine learning classifiers through diverse counterfactual explanations. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA, 11 pages
2020
-
[42]
Maximilian Muschalik, Fabian Fumagalli, Barbara Hammer, and Eyke Hüller- meier. 2024. Beyond TreeSHAP: efficient computation of any-order shapley interactions for tree ensembles. InProceedings of the Thirty-Eighth AAAI Confer- ence on Artificial Intelligence (AAAI’24). AAAI Press, Article 1604, 9 pages
2024
-
[43]
Meike Nauta, Jan Trienes, Shreyasi Pathak, Elisa Nguyen, Michelle Peters, Yasmin Schmitt, Jörg Schlötterer, Maurice van Keulen, and Christin Seifert. 2023. From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI.ACM Comput. Surv.55, 13s (2023)
2023
-
[44]
John Ashworth Nelder and Robert WM Wedderburn. 1972. Generalized linear models.Journal of the Royal Statistical Society Series A: Statistics in Society135 (1972), 370–384
1972
-
[45]
Lars Henry Berge Olsen and Martin Jullum. 2026. Improving the Weighting Strategy in KernelSHAP. InExplainable Artificial Intelligence. Springer Nature Switzerland, 194–218
2026
-
[46]
Iker Perez, Piotr Skalski, Alec Barns-Graham, Jason Wong, and David Sutton. 2022. Attribution of predictive uncertainties in classification models. InProceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence (Proceedings of Machine Learning Research, Vol. 180). PMLR, 1582–1591
2022
-
[47]
2000.Mixed-effects models in S and S-PLUS
José C Pinheiro and Douglas M Bates. 2000.Mixed-effects models in S and S-PLUS. Springer. A Human-Centered Audit of Shapley Value Benchmarks in High-Stakes Workflows
2000
-
[48]
Forough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wort- man Vaughan, and Hanna Wallach. 2021. Manipulating and measuring model interpretability. InProceedings of the 2021 CHI conference on human factors in computing systems. 1–52
2021
-
[49]
Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence1, 5 (2019), 206–215
2019
-
[50]
Waddah Saeed and Christian Omlin. 2023. Explainable AI (XAI): A systematic meta-survey of current challenges and future opportunities.Knowledge-Based Systems263 (2023), 110273
2023
-
[51]
Maria Sahakyan, Zeyar Aung, and Talal Rahwan. 2021. Explainable artificial intelligence for tabular data: A survey.IEEE access9 (2021), 135392–135422
2021
-
[52]
Lloyd S Shapley. 1953. A Value for n-Person Games. InContributions to the Theory of Games II. Princeton University Press, Princeton, 307–317
1953
-
[53]
Venkatesh Sivaraman, Leigh A Bukowski, Joel Levin, Jeremy M Kahn, and Adam Perer. 2023. Ignore, trust, or negotiate: understanding clinician acceptance of AI-based treatment recommendations in health care. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–18
2023
-
[54]
Erik Štrumbelj and Igor Kononenko. 2010. An Efficient Explanation of Individual Classifications using Game Theory.Journal of Machine Learning Research11, 1 (2010), 1–18
2010
-
[55]
Mukund Sundararajan and Amir Najmi. 2020. The Many Shapley Values for Model Explanation. InProceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119). PMLR, 9269–9278
2020
-
[56]
Muhammad Faaiz Taufiq, Patrick Blöbaum, and Lenon Minorics. 2023. Manifold Restricted Interventional Shapley Values. InProceedings of The 26th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 206). PMLR, 5079–5106
2023
-
[57]
Boris Van Breugel and Mihaela Van Der Schaar. 2024. Position: Why Tabular Foundation Models Should Be a Research Priority. InProceedings of the 41st International Conference on Machine Learning, Vol. 235. PMLR, 48976–48993
2024
-
[58]
Giulia Vilone and Luca Longo. 2021. Notions of explainability and evaluation approaches for explainable artificial intelligence.Information Fusion76 (2021), 89–106
2021
-
[59]
Oskar Wysocki, Jessica Katharine Davies, Markel Vigo, Anne Caroline Arm- strong, Dónal Landers, Rebecca Lee, and André Freitas. 2023. Assessing the com- munication gap between AI models and healthcare professionals: Explainability, utility and trust in AI-driven clinical decision-making.Artificial Intelligence316 (2023), 103839
2023
- [60]
-
[61]
Peng Yu, Albert Bifet, Jesse Read, and Chao Xu. 2022. Linear tree shap. In Advances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., 25818–25828
2022
-
[62]
true-to-the-model
Artjom Zern, Klaus Broelemann, and Gjergji Kasneci. 2023. Interventional SHAP Values and Interaction Values for Piecewise Linear Regression Trees.Proceedings of the AAAI Conference on Artificial Intelligence37, 9 (2023), 11164–11173. A Extended Overview of Shapley Value Formulations The definition of the value function𝑣𝒙 (S) hinges on how one repre- sents...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.