pith. machine review for the scientific record. sign in

arxiv: 2604.18311 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI

Recognition: unknown

On the Importance and Evaluation of Narrativity in Natural Language AI Explanations

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords explainable AInatural language explanationsnarrative structureevaluation metricshuman understandingcause-effect reasoningtext generation
0
0 comments X

The pith

Explanations for AI predictions become more understandable when written as narratives with continuous flow, cause-effect links, fluent phrasing, and varied wording instead of static feature lists.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current natural language explanations in explainable AI mostly output lists of important features, which show what affects a prediction but not why it occurs. Drawing on ideas from social sciences and linguistics, it claims that true understanding requires four narrative properties: a continuous structure that links ideas, explicit cause-effect mechanisms, linguistic fluency, and lexical diversity. Standard NLP scoring methods based on word probabilities or frequencies fail here because they rate empty or tautological text as highly as actual explanations. Seven new automatic metrics are introduced to measure the four properties separately, and tests across six datasets show these metrics distinguish narrative-style outputs from descriptive ones more reliably. The work ends by offering a set of general rules for generating XAI explanations that meet the narrative criteria.

Core claim

Explanations for machine learning model behaviour should be presented as narratives rather than static lists of feature importances. Narratives support human understanding through four defining properties: continuous structure, cause-effect mechanisms, linguistic fluency, and lexical diversity. Standard token-probability or frequency-based NLP metrics cannot capture these properties and can be matched by non-explanatory text. Seven new automatic metrics are therefore defined to quantify narrative quality along the four dimensions, and benchmarking on six datasets shows they separate descriptive from narrative explanations more effectively than prior metrics. A set of problem-agnostic rulesis

What carries the argument

The four narrative properties (continuous structure, cause-effect mechanisms, linguistic fluency, and lexical diversity) together with the seven proposed automatic metrics that quantify each property in generated XAI text.

If this is right

  • Current explanation generation methods can be evaluated and ranked more reliably by how well they meet the four narrative properties.
  • Standard NLP metrics based on token probability are unsuitable for assessing explanatory quality because they accept empty text.
  • Following the proposed generation rules produces explanations that exhibit stronger continuous structure, causal links, fluency, and lexical diversity.
  • Benchmark results on six datasets establish a baseline for comparing future narrative explanation systems against existing ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Narrative XAI outputs could reduce misinterpretation of model decisions in domains where users must act on the explanation.
  • The metrics might be extended to score explanations produced by large language models that were not part of the original benchmarks.
  • Adopting narrative rules could make regulatory requirements for AI transparency easier to meet in practice.

Load-bearing premise

The four properties taken from social sciences and linguistics are the main drivers of human understanding for AI explanations, and the new metrics measure them without direct validation against human comprehension studies.

What would settle it

A controlled study in which people achieve equal or higher accuracy at predicting model behaviour from non-narrative feature lists or from text lacking one of the four properties, compared with full narrative explanations, or where the seven new metrics show no correlation with human ratings of explanation usefulness.

Figures

Figures reproduced from arXiv: 2604.18311 by David Martens, Mateusz Cedro.

Figure 1
Figure 1. Figure 1: Entropy-diversity profiles of generated explanations of the templated (TalkToModel, [PITH_FULL_IMAGE:figures/full_fig_p016_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average framework ranks across metrics and six datasets. Each cell shows the mean rank [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Framework average ranks across metric groups. Bars represent mean of per-metric average [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Nominal change in perplexity (PPL) relative to the PPL of the original sentence order across [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cumulative sentence perplexity trajectories of templated and LLM-generated explanatory [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Changes in cumulative perplexity after sentence ordering perturbations of Templated [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Changes in the in the perplexity distribution at particular sentence position of the explanation [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗
read the original abstract

Explainable AI (XAI) aims to make the behaviour of machine learning models interpretable, yet many explanation methods remain difficult to understand. The integration of Natural Language Generation into XAI aims to deliver explanations in textual form, making them more accessible to practitioners. Current approaches, however, largely yield static lists of feature importances. Although such explanations indicate what influences the prediction, they do not explain why the prediction occurs. In this study, we draw on insights from social sciences and linguistics, and argue that XAI explanations should be presented in the form of narratives. Narrative explanations support human understanding through four defining properties: continuous structure, cause-effect mechanisms, linguistic fluency, and lexical diversity. We show that standard Natural Language Processing (NLP) metrics based solely on token probability or word frequency fail to capture these properties and can be matched or exceeded by tautological text that conveys no explanatory content. To address this issue, we propose seven automatic metrics that quantify the narrative quality of explanations along the four identified dimensions. We benchmark current state-of-the-art explanation generation methods on six datasets and show that the proposed metrics separate descriptive from narrative explanations more reliably than standard NLP metrics. Finally, to further advance the field, we propose a set of problem-agnostic XAI Narrative generation rules for producing natural language XAI explanations, so that the resulting XAI Narratives exhibit stronger narrative properties and align with the findings from the linguistic and social science literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper argues that XAI explanations should be presented as narratives rather than static feature lists, drawing on four properties from social sciences and linguistics (continuous structure, cause-effect mechanisms, linguistic fluency, and lexical diversity) to support human understanding. It demonstrates that standard token-probability and word-frequency NLP metrics can be matched or exceeded by tautological text with no explanatory value, proposes seven automatic metrics to quantify the four narrative dimensions, benchmarks these metrics on six datasets against state-of-the-art explanation generators to show improved separation of narrative from descriptive text, and offers a set of problem-agnostic generation rules intended to produce explanations with stronger narrative properties.

Significance. If the proposed metrics prove to correlate with human comprehension outcomes, the work could supply a practical evaluation framework and generation guidelines that move XAI beyond feature-importance lists toward more usable textual explanations. The tautological-text counter-example usefully exposes a concrete limitation of probability-based baselines. The benchmarking results on six datasets provide initial evidence that the new metrics discriminate better than prior approaches, which is a methodological contribution even if downstream human validation is still required.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Benchmarking): The claim that the four narrative properties are primary drivers of human understanding for XAI explanations rests on citations to external literature but is not tested with any human-subject experiments (comprehension tests, trust ratings, or decision-quality measures) comparing high- versus low-narrative explanations. The reported benchmarking only establishes that the metrics separate narrative from descriptive text more reliably than baselines; it does not establish that higher scores on the metrics predict improved human outcomes.
  2. [§3] §3 (Metric definitions): The seven automatic metrics are introduced without derivation details, formal definitions, or error analysis. No sensitivity studies (e.g., robustness to explanation length, domain shift, or lexical variation) are reported, which is load-bearing for the assertion that these metrics reliably quantify the four narrative properties across the six datasets.
minor comments (2)
  1. [Generation rules section] The generation rules in the final section would be easier to apply if accompanied by a short pseudocode or checklist format.
  2. A table summarizing the seven metrics, their formulas, and the narrative dimension each targets would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the scope and presentation of our contributions. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and §4] The claim that the four narrative properties are primary drivers of human understanding for XAI explanations rests on citations to external literature but is not tested with any human-subject experiments (comprehension tests, trust ratings, or decision-quality measures) comparing high- versus low-narrative explanations. The reported benchmarking only establishes that the metrics separate narrative from descriptive text more reliably than baselines; it does not establish that higher scores on the metrics predict improved human outcomes.

    Authors: We agree that the manuscript motivates the four properties via citations to social science and linguistics literature rather than new human-subject studies, and that the benchmarking in §4 demonstrates improved separation of narrative from descriptive text (including via the tautological counter-example) without directly validating correlation to human comprehension outcomes. The paper does not claim such predictive validity for human results. In revision we will (i) update the abstract and §4 to explicitly qualify the claims as literature-grounded and (ii) add a limitations paragraph stating that direct human validation of the metrics remains future work. This is a partial revision because no new experiments are added. revision: partial

  2. Referee: [§3] The seven automatic metrics are introduced without derivation details, formal definitions, or error analysis. No sensitivity studies (e.g., robustness to explanation length, domain shift, or lexical variation) are reported, which is load-bearing for the assertion that these metrics reliably quantify the four narrative properties across the six datasets.

    Authors: We accept that §3 would be strengthened by additional formalization. In the revised version we will expand the section to supply (a) explicit mathematical definitions for each metric, (b) brief derivation notes linking each metric to its target narrative property, and (c) a new error-analysis subsection that includes sensitivity checks on explanation length, domain shift, and lexical variation performed on the existing six datasets. These additions address the referee’s concern without altering the core results. revision: yes

Circularity Check

0 steps flagged

No circularity: claims grounded in external literature with independent metrics and benchmarks

full rationale

The paper draws its four narrative properties (continuous structure, cause-effect mechanisms, linguistic fluency, lexical diversity) from cited social-science and linguistics sources rather than defining them in terms of its own outputs or metrics. It then proposes seven new automatic metrics and generation rules as original contributions, benchmarks them against standard NLP baselines on six external datasets, and demonstrates separation of narrative vs. descriptive text without any self-referential fitting, parameter renaming as prediction, or load-bearing self-citation chains. The derivation chain remains self-contained against external benchmarks and does not reduce any core claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central argument rests on domain assumptions imported from social sciences and linguistics about effective explanations; no free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • domain assumption Narrative explanations support human understanding through continuous structure, cause-effect mechanisms, linguistic fluency, and lexical diversity.
    Stated as drawn from insights in social sciences and linguistics.

pith-pipeline@v0.9.0 · 5555 in / 1190 out tokens · 52082 ms · 2026-05-10T05:17:27.903994+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 24 canonical work pages · 3 internal anchors

  1. [1]

    Explanation in artificial intelligence: Insights from the social sciences.Artificial intelligence, 267:1–38, 2019

    Tim Miller. Explanation in artificial intelligence: Insights from the social sciences.Artificial intelligence, 267:1–38, 2019

  2. [2]

    Tell me a story! Narrative-driven XAI with Large Language Models.Decision Support Systems, 191:114402, 2025

    David Martens, James Hinns, Camille Dams, Mark Vergouwen, and Theodoros Evgeniou. Tell me a story! Narrative-driven XAI with Large Language Models.Decision Support Systems, 191:114402, 2025. ISSN 0167-9236. doi: https://doi.org/10.1016/j.dss.2025.114402. URL https://www.sciencedirect.com/science/article/pii/S016792362500003X

  3. [3]

    Explainable ai: Beware of inmates running the asylum or: How i learnt to stop worrying and love the social and behavioural sciences.arXiv preprint arXiv:1712.00547, 2017

    Tim Miller, Piers Howe, and Liz Sonenberg. Explainable ai: Beware of inmates running the asylum or: How i learnt to stop worrying and love the social and behavioural sciences.arXiv preprint arXiv:1712.00547, 2017

  4. [4]

    Using narratives and storytelling to communicate science with nonexpert audiences.Proceedings of the national academy of sciences, 111(supplement_4):13614–13620, 2014

    Michael F Dahlstrom. Using narratives and storytelling to communicate science with nonexpert audiences.Proceedings of the national academy of sciences, 111(supplement_4):13614–13620, 2014

  5. [5]

    GraphXAIN: Narratives to Explain Graph Neural Networks

    Mateusz Cedro and David Martens. GraphXAIN: Narratives to Explain Graph Neural Networks. InExplainable Artificial Intelligence, pages 91–114. Springer Nature Switzerland, 2025. ISBN 978-3-032-08327-2

  6. [6]

    Explingo: Explaining ai predictions using large language models

    Alexandra Zytek, Sara Pido, Sarah Alnegheimish, Laure Berti-Equille, and Kalyan Veera- machaneni. Explingo: Explaining ai predictions using large language models. In2024 IEEE International Conference on Big Data (BigData), pages 1197–1208. IEEE, 2024

  7. [7]

    Explaining machine learning models with interactive natural language conversations using talktomodel.Nature Machine Intelligence, 5(8):873–883, 2023

    Dylan Slack, Satyapriya Krishna, Himabindu Lakkaraju, and Sameer Singh. Explaining machine learning models with interactive natural language conversations using talktomodel.Nature Machine Intelligence, 5(8):873–883, 2023

  8. [8]

    Explanation and scientific understanding.the Journal of Philosophy, 71(1): 5–19, 1974

    Michael Friedman. Explanation and scientific understanding.the Journal of Philosophy, 71(1): 5–19, 1974

  9. [9]

    A theoretical framework for narrative explanation in science.Science education, 89(4): 535–563, 2005

    Stephen P Norris, Sandra M Guilbert, Martha L Smith, Shahram Hakimelahi, and Linda M Phillips. A theoretical framework for narrative explanation in science.Science education, 89(4): 535–563, 2005

  10. [10]

    Harvard university press, 1986

    Jerome S Bruner.Actual minds, possible worlds. Harvard university press, 1986

  11. [11]

    A unified approach to interpreting model predictions

    Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017

  12. [12]

    Why should I trust you?

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why should I trust you?" Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016

  13. [13]

    Explaining data-driven document classifications.MIS quarterly, 38(1):73–100, 2014

    David Martens and Foster Provost. Explaining data-driven document classifications.MIS quarterly, 38(1):73–100, 2014

  14. [14]

    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

    Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps.arXiv preprint arXiv:1312.6034, 2013. 21

  15. [15]

    R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D

    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient- based localization. In2017 IEEE International Conference on Computer Vision (ICCV), pages 618–626, 2017. doi: 10.1109/ICCV.2017.74

  16. [16]

    Gnnexplainer: Generating explanations for graph neural networks.Advances in neural information processing systems, 32, 2019

    Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. Gnnexplainer: Generating explanations for graph neural networks.Advances in neural information processing systems, 32, 2019

  17. [17]

    A survey on xai and natural language explanations.Information Processing & Management, 60 (1):103111, 2023

    Erik Cambria, Lorenzo Malandri, Fabio Mercorio, Mario Mezzanzanica, and Navid Nobani. A survey on xai and natural language explanations.Information Processing & Management, 60 (1):103111, 2023

  18. [18]

    A generative model for category text generation.Information Sciences, 450:301–315, 2018

    Yang Li, Quan Pan, Suhang Wang, Tao Yang, and Erik Cambria. A generative model for category text generation.Information Sciences, 450:301–315, 2018. ISSN 0020-0255. doi: https://doi.org/10.1016/j.ins.2018.03.050. URL https://www.sciencedirect.com/scie nce/article/pii/S0020025518302366

  19. [19]

    Graphnarrator: Generating textual explanations for graph neural networks

    Bo Pan, Zhen Xiong, Guanchen Wu, Zheng Zhang, Yifei Zhang, Yuntong Hu, and Liang Zhao. Graphnarrator: Generating textual explanations for graph neural networks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23–42, 2025

  20. [20]

    Zero-shot natural language explanations

    Fawaz Sammani and Nikos Deligiannis. Zero-shot natural language explanations. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://open review.net/forum?id=X6VVK8pIzZ

  21. [21]

    An agentic approach to generating XAI-Narratives.arXiv preprint arXiv:2603.20003, 2026

    Yifan He and David Martens. An agentic approach to generating XAI-Narratives.arXiv preprint arXiv:2603.20003, 2026

  22. [22]

    FLamE: Few-shot learning from nat- ural language explanations

    Yangqiaoyu Zhou, Yiming Zhang, and Chenhao Tan. FLamE: Few-shot learning from nat- ural language explanations. InProceedings of the 61st Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 6743–6763. Associa- tion for Computational Linguistics, 2023. doi: 10.18653/v1/2023.acl-long.372. URL https://aclanthology....

  23. [23]

    Towards harnessing natural language genera- tion to explain black-box models

    Ettore Mariotti, Jose M Alonso, and Albert Gatt. Towards harnessing natural language genera- tion to explain black-box models. In2nd Workshop on interactive natural language technology for explainable artificial intelligence, pages 22–27, 2020

  24. [24]

    Perplexity—a measure of the difficulty of speech recognition tasks.The Journal of the Acoustical Society of America, 62(S1): S63–S63, 1977

    Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. Perplexity—a measure of the difficulty of speech recognition tasks.The Journal of the Acoustical Society of America, 62(S1): S63–S63, 1977

  25. [25]

    How good is my story? towards quantitative metrics for evaluating LLM-generated XAI narratives,

    Timour Ichmoukhamedov, James Hinns, and David Martens. How good is my story? To- wards quantitative metrics for evaluating LLM-generated XAI narratives.arXiv preprint arXiv:2412.10220, 2024

  26. [26]

    Rule2Text: A framework for generating and evaluating natural language explanations of knowledge graph rules,

    Nasim Shirvani-Mahdavi and Chengkai Li. Rule2text: A framework for generating and evaluat- ing natural language explanations of knowledge graph rules.arXiv preprint arXiv:2508.10971, 2025

  27. [27]

    Perplexity from plm is unreliable for evaluating text quality.arXiv preprint arXiv:2210.05892,

    Yequan Wang, Jiawen Deng, Aixin Sun, and Xuying Meng. Perplexity from plm is unreliable for evaluating text quality.arXiv preprint arXiv:2210.05892, 2022

  28. [28]

    Hashimoto, Hugh Zhang, and Percy Liang

    Tatsunori B. Hashimoto, Hugh Zhang, and Percy Liang. Unifying human and statistical evaluation for natural language generation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1689–1701, Minneapolis, Minnesota,

  29. [29]

    doi: 10.18653/v1/N19-1169

    Association for Computational Linguistics. doi: 10.18653/v1/N19-1169

  30. [30]

    The price of debiasing automatic metrics in natural language evaluation.arXiv preprint arXiv:1807.02202, 2018

    Arun Tejasvi Chaganty, Stephen Mussman, and Percy Liang. The price of debiasing automatic metrics in natural language evaluation.arXiv preprint arXiv:1807.02202, 2018. 22

  31. [31]

    Perplexity cannot always tell right from wrong.arXiv preprint arXiv:2601.22950, 2026

    Petar Veliˇckovi´c, Federico Barbero, Christos Perivolaropoulos, Simon Osindero, and Razvan Pascanu. Perplexity cannot always tell right from wrong.arXiv preprint arXiv:2601.22950, 2026

  32. [32]

    What is wrong with perplexity for long-context language modeling?arXiv preprint arXiv:2410.23771,

    Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, and Yisen Wang. What is wrong with perplexity for long-context language modeling? arXiv preprint arXiv:2410.23771, 2024

  33. [33]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  34. [34]

    Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

  35. [35]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

  36. [36]

    Bleurt: Learning robust metrics for text generation

    Thibault Sellam, Dipanjan Das, and Ankur Parikh. Bleurt: Learning robust metrics for text generation. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 7881–7892, 2020

  37. [37]

    Language model evaluation beyond perplexity

    Clara Meister and Ryan Cotterell. Language model evaluation beyond perplexity. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5328–5339. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021....

  38. [38]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  39. [39]

    From feature importance to natural language explanations using llms with rag.arXiv preprint arXiv:2407.20990, 2024

    Sule Tekkesinoglu and Lars Kunze. From feature importance to natural language explanations using llms with rag.arXiv preprint arXiv:2407.20990, 2024

  40. [40]

    Coh-metrix: Analysis of text on cohesion and language.Behavior research methods, instruments, & comput- ers, 36(2):193–202, 2004

    Arthur C Graesser, Danielle S McNamara, Max M Louwerse, and Zhiqiang Cai. Coh-metrix: Analysis of text on cohesion and language.Behavior research methods, instruments, & comput- ers, 36(2):193–202, 2004

  41. [41]

    Recollection versus imagination: Exploring human memory and cognition via neural language models

    Maarten Sap, Eric Horvitz, Yejin Choi, Noah A Smith, and James Pennebaker. Recollection versus imagination: Exploring human memory and cognition via neural language models. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 1970–1978, 2020

  42. [42]

    Graesser and Danielle S

    Arthur C. Graesser and Danielle S. McNamara. Computational analyses of multilevel discourse comprehension.Topics in Cognitive Science, 3(2):371–398, 2011. doi: https://doi.org/10.1111/ j.1756-8765.2010.01081.x. URL https://onlinelibrary.wiley.com/doi/abs/10.111 1/j.1756-8765.2010.01081.x

  43. [43]

    Gandomi, Fang Chen, and Andreas Holzinger

    Jianlong Zhou, Amir H. Gandomi, Fang Chen, and Andreas Holzinger. Evaluating the quality of machine learning explanations: A survey on methods and metrics.Electronics, 10(5), 2021. ISSN 2079-9292. doi: 10.3390/electronics10050593. URL https://www.mdpi.com/207 9-9292/10/5/593

  44. [44]

    Scientific explanation and the causal structure of the world

    Wesley C Salmon. Scientific explanation and the causal structure of the world. 1984

  45. [45]

    Four decades of scientific explanation.[part 3] the second decade (1958-67): Manifest destiny–expansion and conflict

    Wesley C Salmon. Four decades of scientific explanation.[part 3] the second decade (1958-67): Manifest destiny–expansion and conflict. 1989

  46. [46]

    Connectives and narrative text: The role of continuity.Memory & Cognition, 25(2):227–236, 1997

    John D Murray. Connectives and narrative text: The role of continuity.Memory & Cognition, 25(2):227–236, 1997. 23

  47. [47]

    Coh-metrix: Providing multilevel analyses of text characteristics.Educational researcher, 40(5):223–234, 2011

    Arthur C Graesser, Danielle S McNamara, and Jonna M Kulikowich. Coh-metrix: Providing multilevel analyses of text characteristics.Educational researcher, 40(5):223–234, 2011

  48. [48]

    Deictic shift theory and the poetics of involvement in narrative

    Mary Galbraith. Deictic shift theory and the poetics of involvement in narrative. InDeixis in narrative, pages 19–59. Psychology Press, 1995

  49. [49]

    Updating a situation model: a memory-based text processing view.Journal of Experimental Psychology: Learning, Memory, and Cognition, 24(5):1200, 1998

    Edward J O’Brien, Michelle L Rizzella, Jason E Albrecht, and Jennifer G Halleran. Updating a situation model: a memory-based text processing view.Journal of Experimental Psychology: Learning, Memory, and Cognition, 24(5):1200, 1998

  50. [50]

    Discovering explanations.Explanation and cognition, pages 21–59, 2000

    Herbert A Simon. Discovering explanations.Explanation and cognition, pages 21–59, 2000

  51. [51]

    Studies in the logic of explanation.Philosophy of science, 15(2):135–175, 1948

    Carl G Hempel and Paul Oppenheim. Studies in the logic of explanation.Philosophy of science, 15(2):135–175, 1948

  52. [52]

    The relative ease of writing narrative text

    Ronald T Kellogg et al. The relative ease of writing narrative text. 1991

  53. [53]

    Examining the reading difficulty of secondary students with learning disabilities: Expository versus narrative text.Remedial and Special Education, 23(1): 31–41, 2002

    Laura M Saenz and Lynn S Fuchs. Examining the reading difficulty of secondary students with learning disabilities: Expository versus narrative text.Remedial and Special Education, 23(1): 31–41, 2002

  54. [54]

    Expository text

    Charles A Weaver III and Walter Kintsch. Expository text. 1991

  55. [55]

    Prediction and entropy of printed english.Bell system technical journal, 30 (1):50–64, 1951

    Claude E Shannon. Prediction and entropy of printed english.Bell system technical journal, 30 (1):50–64, 1951

  56. [56]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019

  57. [57]

    Constructing a lexicon of english discourse connectives

    Debopam Das, Tatjana Scheffler, Peter Bourgonje, and Manfred Stede. Constructing a lexicon of english discourse connectives. InProceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pages 360–365, 2018

  58. [58]

    Scientific explanation and natural language: A unified epistemological-linguistic perspective for explainable ai.arXiv preprint arXiv:2205.01809, 2022

    Marco Valentino and André Freitas. Scientific explanation and natural language: A unified epistemological-linguistic perspective for explainable ai.arXiv preprint arXiv:2205.01809, 2022

  59. [59]

    On the nature of explanation: An epistemological-linguistic perspective for explanation-based natural language inference.Philosophy & Technology, 37(3): 88, 2024

    Marco Valentino and André Freitas. On the nature of explanation: An epistemological-linguistic perspective for explanation-based natural language inference.Philosophy & Technology, 37(3): 88, 2024

  60. [60]

    Explanatory unification.Philosophy of Science, 48(4):507–531, 1981

    Philip Kitcher. Explanatory unification.Philosophy of Science, 48(4):507–531, 1981. ISSN 00318248, 1539767X. URLhttp://www.jstor.org/stable/186834

  61. [61]

    Free press New York, 1965

    Carl G Hempel et al.Aspects of scientific explanation, volume 965. Free press New York, 1965

  62. [62]

    Penn discourse treebank version 3.0.LDC2019T05, 2019

    Rashmi Prasad, Bonnie Webber, Alan Lee, and Aravind Joshi. Penn discourse treebank version 3.0.LDC2019T05, 2019

  63. [63]

    Improving neural language models with a continuous cache.arXiv preprint arXiv:1612.04426, 2016

    Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache.arXiv preprint arXiv:1612.04426, 2016

  64. [64]

    Factuality enhanced language models for open-ended text generation

    Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, and Bryan Catanzaro. Factuality enhanced language models for open-ended text generation. In Advances in Neural Information Processing Systems, volume 35, pages 34586–34599. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper /2022/file/df438ca...

  65. [65]

    A diversity- promoting objective function for neural conversation models

    Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and William B Dolan. A diversity- promoting objective function for neural conversation models. InProceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 110–119, 2016. 24

  66. [66]

    Understanding the dilemma of explainable artificial intelligence: a proposal for a ritual dialog framework.Humanities and Social Sciences Communications, 11 (1):1–9, 2024

    Aorigele Bao and Yi Zeng. Understanding the dilemma of explainable artificial intelligence: a proposal for a ritual dialog framework.Humanities and Social Sciences Communications, 11 (1):1–9, 2024

  67. [67]

    Random forests.Machine learning, 45(1):5–32, 2001

    Leo Breiman. Random forests.Machine learning, 45(1):5–32, 2001

  68. [68]

    Towards unifying evaluation of counterfactual explanations: Leveraging large language models for human-centric assessments

    Marharyta Domnich, Julius Välja, Rasmus Moorits Veski, Giacomo Magnifico, Kadi Tulver, Eduard Barbu, and Raul Vicente. Towards unifying evaluation of counterfactual explanations: Leveraging large language models for human-centric assessments. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 16308–16316, 2025

  69. [69]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  70. [70]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  71. [71]

    Statistical comparisons of classifiers over multiple data sets.Journal of Machine learning research, 7(Jan):1–30, 2006

    Janez Demšar. Statistical comparisons of classifiers over multiple data sets.Journal of Machine learning research, 7(Jan):1–30, 2006

  72. [72]

    Yamshchikov

    Alexey Tikhonov, Igor Samenko, and Ivan P. Yamshchikov. StoryDB: Broad multi-language narrative dataset. In Yang Gao, Steffen Eger, Wei Zhao, Piyawat Lertvittayakumjorn, and Marina Fomicheva, editors,Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 32–39, Punta Cana, Dominican Republic, November 2021. Association for Comp...

  73. [73]

    Start the explanation immediately

  74. [74]

    Limit the entire answer to exactly {sentence_limit} sentences

  75. [75]

    Only mention the top {num_feat} most important features in the narrative

  76. [76]

    The goal is to have a narrative/story

    Do not use tables or lists, or simply rattle through the features and/or nodes one by one. The goal is to have a narrative/story. **Content related rules**:

  77. [77]

    Be clear about what the model actually predicted for the {target_instance}

  78. [78]

    Make sure to clearly establish this the first time you refer to a feature

    Discuss how the features contributed to final prediction. Make sure to clearly establish this the first time you refer to a feature

  79. [79]

    Consider the feature importance, feature values, and averages when referencing their relative importance

  80. [80]

    The reader should be able to tell what the order of importance of the features is based on their feature importance value

    Begin the discussion of features by presenting those with the highest absolute feature importance values first. The reader should be able to tell what the order of importance of the features is based on their feature importance value

Showing first 80 references.