pith. machine review for the scientific record. sign in

arxiv: 2605.04410 · v1 · submitted 2026-05-06 · 💻 cs.CV · cs.AI· cs.CY· cs.LG

Recognition: 2 theorem links

Evaluation Cards for XAI Metrics

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CYcs.LG
keywords explainable AIXAI metricsevaluation standardizationdocumentation templatesmodel cardsmetric validationresearch accountabilitygaming risks
0
0 comments X

The pith

The XAI Evaluation Card is a documentation template that requires explicit reporting of target properties, assumptions, validations, risks, and failures for any new explainable AI metric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

XAI metric evaluations currently suffer from inconsistent definitions, incomplete reporting, and rare validation against baselines. The paper introduces the XAI Evaluation Card as a standardized template to accompany every study proposing a new metric. The template mandates clear declarations on target properties, grounding levels, assumptions, validation evidence, gaming risks, and known failure cases. A sympathetic reader would care because standardized reporting would make it easier to compare, trust, and build upon XAI methods. If adopted as a norm, the approach would cut evaluation fragmentation, enable meta-analyses across papers, and raise accountability for metric quality.

Core claim

The paper claims that lack of standardization in XAI evaluations stems from inconsistently defined metrics, incomplete reporting, and insufficient validation, and that mandating the XAI Evaluation Card template for all new metric proposals will address this by requiring explicit statements on target properties, grounding levels, metric assumptions, validation evidence, gaming risks, and known failure cases.

What carries the argument

The XAI Evaluation Card, a documentation template modeled on model cards that structures the disclosure of target properties, grounding levels, assumptions, validation evidence, gaming risks, and known failure cases for new XAI evaluation metrics.

If this is right

  • Evaluations of XAI methods would become less fragmented across independent studies.
  • Meta-analyses comparing different XAI approaches would become more feasible due to standardized disclosures.
  • Researchers would be held more accountable for the limitations and risks of the metrics they propose.
  • New metrics would more often include explicit validation against common baselines and documented failure cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar card templates might be developed for evaluation metrics in adjacent areas such as fairness or robustness testing.
  • Widespread use could support the creation of a shared public registry of XAI metrics with their completed cards.
  • Journal and conference reviewers might begin requiring the card, which could speed up adoption beyond voluntary uptake.

Load-bearing premise

That researchers and reviewers will voluntarily adopt and enforce the template in a substantive way rather than treating it as a superficial checklist.

What would settle it

A review of papers introducing new XAI metrics after the template is proposed that finds no measurable improvement in reporting completeness, consistency of definitions, or presence of validation evidence and failure-case analysis.

read the original abstract

The evaluation of explainable AI (XAI) methods is affected by a lack of standardization. Metrics are inconsistently defined, incompletely reported, and rarely validated against common baselines. In this paper, we identify transparency of evaluation reporting as a central, under-addressed problem. We propose the XAI Evaluation Card, a documentation template analogous to model cards, designed to accompany any study that introduces an XAI evaluation metric. The card covers explicit declaration of target properties, grounding levels, metric assumptions, validation evidence, gaming risks, and known failure cases. We argue that adopting this template as a community norm would reduce evaluation fragmentation, support meta-analysis, and improve accountability in XAI research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies a lack of standardization in XAI metric evaluation, including inconsistent definitions, incomplete reporting, and rare validation against baselines. It proposes the XAI Evaluation Card, a documentation template to accompany studies introducing new metrics, covering target properties, grounding levels, assumptions, validation evidence, gaming risks, and failure cases. The authors argue that community-wide adoption of this template as a norm would reduce fragmentation, support meta-analysis, and improve accountability.

Significance. If implemented, the template could encourage more explicit and transparent reporting of XAI metric properties and limitations, analogous to model cards, potentially aiding comparability across studies. The proposal is concrete in its coverage of relevant aspects such as gaming risks and known failure cases. However, the claimed benefits remain hypothetical, as the manuscript offers no pilot data, adoption analysis, or comparison to prior standardization efforts to substantiate the impact.

major comments (2)
  1. [Abstract and conclusion] The central claim that adoption 'as a community norm' would reduce fragmentation and improve accountability (abstract and concluding section) rests on an untested behavioral assumption with no supporting analysis, pilot deployment, or discussion of enforcement mechanisms such as reviewer guidelines or journal policies.
  2. [Section describing the card template] No filled example of the XAI Evaluation Card is provided for any existing metric, nor is there a case study demonstrating how completing the card would alter or improve current reporting practices; this omission leaves the practical utility unillustrated.
minor comments (2)
  1. [Introduction] The manuscript could benefit from explicit comparison to related templates (e.g., model cards or datasheets) to clarify what is novel in the proposed structure.
  2. [Card description] Terminology such as 'grounding levels' and 'gaming risks' is introduced without a dedicated definitions subsection, which may reduce clarity for readers unfamiliar with XAI evaluation literature.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback, which has helped clarify the scope and presentation of our proposal. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract and conclusion] The central claim that adoption 'as a community norm' would reduce fragmentation and improve accountability (abstract and concluding section) rests on an untested behavioral assumption with no supporting analysis, pilot deployment, or discussion of enforcement mechanisms such as reviewer guidelines or journal policies.

    Authors: We agree that the benefits of adoption are prospective and rest on behavioral assumptions that the current manuscript does not empirically test. In the revision we have moderated the language in both the abstract and conclusion to present the expected improvements as contingent on successful community uptake rather than assured outcomes. We have also added a dedicated paragraph in the discussion section outlining plausible enforcement pathways, including integration into conference reviewer guidelines and journal submission requirements, drawing on the precedent of model cards. These changes make the claims more precise without overstating what the work demonstrates. revision: partial

  2. Referee: [Section describing the card template] No filled example of the XAI Evaluation Card is provided for any existing metric, nor is there a case study demonstrating how completing the card would alter or improve current reporting practices; this omission leaves the practical utility unillustrated.

    Authors: We accept this criticism. The revised manuscript now contains a new appendix with a complete, filled-out XAI Evaluation Card for the well-known 'Faithfulness' metric. The example is accompanied by a short commentary that contrasts the card's disclosures with the typical reporting found in the original metric paper, thereby illustrating concrete improvements in transparency regarding assumptions, validation baselines, and gaming risks. revision: yes

standing simulated objections not resolved
  • The manuscript offers no pilot deployment data or adoption analysis to quantify the template's impact; such evidence would require a separate empirical study that lies outside the scope of the present proposal.

Circularity Check

0 steps flagged

No circularity: proposal is a self-contained normative recommendation without derivations or self-referential reductions

full rationale

The manuscript identifies inconsistencies in XAI metric evaluation and proposes an Evaluation Card template as a documentation standard. Its central argument—that voluntary community adoption would reduce fragmentation and improve accountability—is presented as a reasoned recommendation based on observed problems, not as a derived prediction or first-principles result. No equations, parameters, or quantitative claims appear that could reduce to fitted inputs or prior self-citations by construction. The text contains no load-bearing self-citations, uniqueness theorems, or ansatzes that loop back to the paper's own inputs. This is a standard case of an honest non-finding for a purely prescriptive paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal rests on the domain assumption that inconsistent metric reporting is the central barrier to progress in XAI evaluation and that a standardized card will address it without introducing new problems.

axioms (1)
  • domain assumption Inconsistent definition, incomplete reporting, and lack of validation against baselines are the primary problems in XAI metric evaluation
    Explicitly stated in the abstract as the motivation for the work.
invented entities (1)
  • XAI Evaluation Card no independent evidence
    purpose: A documentation template to accompany papers introducing XAI evaluation metrics
    Newly proposed artifact with no prior existence or independent validation shown.

pith-pipeline@v0.9.0 · 5413 in / 1259 out tokens · 28494 ms · 2026-05-08T17:47:06.140286+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    Methods and metrics for explaining artificial intelligence models: A review.Ex- plainable AI: Foundations, methodologies and applications, pages 61–88, 2022

    Puja Banerjee and Rajesh P Barnwal. Methods and metrics for explaining artificial intelligence models: A review.Ex- plainable AI: Foundations, methodologies and applications, pages 61–88, 2022. 2

  2. [2]

    Evaluation metrics in explainable artificial intelligence (XAI)

    Loredana Coroama and Adrian Groza. Evaluation metrics in explainable artificial intelligence (XAI). InInternational conference on advanced research in technologies, informa- tion, innovation and sustainability, pages 401–413. Springer,

  3. [3]

    Unifying VXAI: A Systematic Review and Framework for the Evaluation of Explainable AI.arXiv preprint arXiv:2506.15408, 2025

    David Dembinsky, Adriano Lucieri, Stanislav Frolov, Hiba Najjar, Ko Watanabe, and Andreas Dengel. Unifying VXAI: A Systematic Review and Framework for the Evaluation of Explainable AI.arXiv preprint arXiv:2506.15408, 2025. 2

  4. [4]

    Towards A Rigorous Science of Interpretable Machine Learning

    Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning.arXiv preprint arXiv:1702.08608, 2017. 2, 3

  5. [5]

    Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jen- nifer Wortman Vaughan, Hanna Wallach, Hal Daum´e Iii, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. 1, 2, 3

  6. [6]

    Ex- plainable AI (XAI) in image segmentation in medicine, in- dustry, and beyond: A survey.Ict Express, 10(6):1331–1354,

    Rokas Gipi ˇskis, Chun-Wei Tsai, and Olga Kurasova. Ex- plainable AI (XAI) in image segmentation in medicine, in- dustry, and beyond: A survey.Ict Express, 10(6):1331–1354,

  7. [7]

    Eval- uation metrics for XAI: A review, taxonomy, and practical applications

    Md Abdul Kadir, Amir Mosavi, and Daniel Sonntag. Eval- uation metrics for XAI: A review, taxonomy, and practical applications. In2023 IEEE 27th International Conference on Intelligent Engineering Systems (INES), pages 000111– 000124. IEEE, 2023. 1

  8. [8]

    XAI systems evaluation: a review of hu- man and computer-centred methods.Applied Sciences, 12 (19):9423, 2022

    Pedro Lopes, Eduardo Silva, Cristiana Braga, Tiago Oliveira, and Lu´ıs Rosado. XAI systems evaluation: a review of hu- man and computer-centred methods.Applied Sciences, 12 (19):9423, 2022. 1

  9. [9]

    On the Design and Evaluation of Human-centered Explainable AI Systems: A Systematic Review and Taxonomy,

    Aline Mangold, Juliane Zietz, Susanne Weinhold, and Se- bastian Pannasch. On the Design and Evaluation of Human- centered Explainable AI Systems: A Systematic Review and Taxonomy.arXiv preprint arXiv:2510.12201, 2025. 1

  10. [10]

    Model cards for model reporting

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the conference on fairness, accountability, and transparency, pages 220–229,

  11. [11]

    A multi- disciplinary survey and framework for design and evaluation of explainable AI systems.ACM Transactions on Interactive Intelligent Systems (TiiS), 11(3-4):1–45, 2021

    Sina Mohseni, Niloofar Zarei, and Eric D Ragan. A multi- disciplinary survey and framework for design and evaluation of explainable AI systems.ACM Transactions on Interactive Intelligent Systems (TiiS), 11(3-4):1–45, 2021. 1

  12. [12]

    From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable AI.ACM Computing Surveys, 55 (13s):1–42, 2023

    Meike Nauta, Jan Trienes, Shreyasi Pathak, Elisa Nguyen, Michelle Peters, Yasmin Schmitt, J ¨org Schl¨otterer, Maurice Van Keulen, and Christin Seifert. From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable AI.ACM Computing Surveys, 55 (13s):1–42, 2023. 1

  13. [13]

    Evaluating the necessity of the multiple metrics for assessing explainable AI: A critical examination

    Marek Pawlicki, Aleksandra Pawlicka, Federica Uccello, Sebastian Szelest, Salvatore D’Antonio, Rafał Kozik, and Michał Chora´s. Evaluating the necessity of the multiple metrics for assessing explainable AI: A critical examination. Neurocomputing, 602:128282, 2024. 1, 3

  14. [14]

    RISE: Randomized Input Sampling for Explanation of Black-box Models

    Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Random- ized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421, 2018. 2 4

  15. [15]

    Analyzing XAI metrics: Summary of the lit- erature review.Authorea Preprints, 2022

    Marion Sisk, Makeen Majlis, Cameron Page, and Abbas Yazdinejad. Analyzing XAI metrics: Summary of the lit- erature review.Authorea Preprints, 2022. 1

  16. [16]

    Evaluating the quality of machine learning expla- nations: A survey on methods and metrics.Electronics, 10 (5):593, 2021

    Jianlong Zhou, Amir H Gandomi, Fang Chen, and Andreas Holzinger. Evaluating the quality of machine learning expla- nations: A survey on methods and metrics.Electronics, 10 (5):593, 2021. 2 5 Evaluation Cards for XAI Metrics Supplementary Material 1 A. Example of a Filled Out Evaluation Card XAI Evaluation Card I. Identity Metric Name Deletion Area Under t...