arxiv: 2604.18311 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI

Recognition: unknown

On the Importance and Evaluation of Narrativity in Natural Language AI Explanations

Mateusz Cedro , David Martens

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords explainable AInatural language explanationsnarrative structureevaluation metricshuman understandingcause-effect reasoningtext generation

0 comments

The pith

Explanations for AI predictions become more understandable when written as narratives with continuous flow, cause-effect links, fluent phrasing, and varied wording instead of static feature lists.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current natural language explanations in explainable AI mostly output lists of important features, which show what affects a prediction but not why it occurs. Drawing on ideas from social sciences and linguistics, it claims that true understanding requires four narrative properties: a continuous structure that links ideas, explicit cause-effect mechanisms, linguistic fluency, and lexical diversity. Standard NLP scoring methods based on word probabilities or frequencies fail here because they rate empty or tautological text as highly as actual explanations. Seven new automatic metrics are introduced to measure the four properties separately, and tests across six datasets show these metrics distinguish narrative-style outputs from descriptive ones more reliably. The work ends by offering a set of general rules for generating XAI explanations that meet the narrative criteria.

Core claim

Explanations for machine learning model behaviour should be presented as narratives rather than static lists of feature importances. Narratives support human understanding through four defining properties: continuous structure, cause-effect mechanisms, linguistic fluency, and lexical diversity. Standard token-probability or frequency-based NLP metrics cannot capture these properties and can be matched by non-explanatory text. Seven new automatic metrics are therefore defined to quantify narrative quality along the four dimensions, and benchmarking on six datasets shows they separate descriptive from narrative explanations more effectively than prior metrics. A set of problem-agnostic rulesis

What carries the argument

The four narrative properties (continuous structure, cause-effect mechanisms, linguistic fluency, and lexical diversity) together with the seven proposed automatic metrics that quantify each property in generated XAI text.

If this is right

Current explanation generation methods can be evaluated and ranked more reliably by how well they meet the four narrative properties.
Standard NLP metrics based on token probability are unsuitable for assessing explanatory quality because they accept empty text.
Following the proposed generation rules produces explanations that exhibit stronger continuous structure, causal links, fluency, and lexical diversity.
Benchmark results on six datasets establish a baseline for comparing future narrative explanation systems against existing ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Narrative XAI outputs could reduce misinterpretation of model decisions in domains where users must act on the explanation.
The metrics might be extended to score explanations produced by large language models that were not part of the original benchmarks.
Adopting narrative rules could make regulatory requirements for AI transparency easier to meet in practice.

Load-bearing premise

The four properties taken from social sciences and linguistics are the main drivers of human understanding for AI explanations, and the new metrics measure them without direct validation against human comprehension studies.

What would settle it

A controlled study in which people achieve equal or higher accuracy at predicting model behaviour from non-narrative feature lists or from text lacking one of the four properties, compared with full narrative explanations, or where the seven new metrics show no correlation with human ratings of explanation usefulness.

Figures

Figures reproduced from arXiv: 2604.18311 by David Martens, Mateusz Cedro.

**Figure 2.** Figure 2: Average framework ranks across metrics and six datasets. Each cell shows the mean rank [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

**Figure 3.** Figure 3: Framework average ranks across metric groups. Bars represent mean of per-metric average [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: Nominal change in perplexity (PPL) relative to the PPL of the original sentence order across [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗

**Figure 5.** Figure 5: Cumulative sentence perplexity trajectories of templated and LLM-generated explanatory [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗

**Figure 6.** Figure 6: Changes in cumulative perplexity after sentence ordering perturbations of Templated [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗

**Figure 7.** Figure 7: Changes in the in the perplexity distribution at particular sentence position of the explanation [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗

read the original abstract

Explainable AI (XAI) aims to make the behaviour of machine learning models interpretable, yet many explanation methods remain difficult to understand. The integration of Natural Language Generation into XAI aims to deliver explanations in textual form, making them more accessible to practitioners. Current approaches, however, largely yield static lists of feature importances. Although such explanations indicate what influences the prediction, they do not explain why the prediction occurs. In this study, we draw on insights from social sciences and linguistics, and argue that XAI explanations should be presented in the form of narratives. Narrative explanations support human understanding through four defining properties: continuous structure, cause-effect mechanisms, linguistic fluency, and lexical diversity. We show that standard Natural Language Processing (NLP) metrics based solely on token probability or word frequency fail to capture these properties and can be matched or exceeded by tautological text that conveys no explanatory content. To address this issue, we propose seven automatic metrics that quantify the narrative quality of explanations along the four identified dimensions. We benchmark current state-of-the-art explanation generation methods on six datasets and show that the proposed metrics separate descriptive from narrative explanations more reliably than standard NLP metrics. Finally, to further advance the field, we propose a set of problem-agnostic XAI Narrative generation rules for producing natural language XAI explanations, so that the resulting XAI Narratives exhibit stronger narrative properties and align with the findings from the linguistic and social science literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces seven metrics for narrative quality in XAI explanations plus generation rules, but the benchmarks only separate text types without testing real human benefits.

read the letter

The main takeaway is that the paper pushes narrative explanations for XAI based on four properties from linguistics and social science, introduces seven metrics to score them, and shows they outperform standard NLP metrics on six datasets. It also offers generation rules to create such narratives. What they do well is spot the problem with current list-based explanations and demonstrate how basic metrics can be fooled by repetitive or empty text. The new metrics seem more robust in the benchmarks, and pulling in those four dimensions gives a clear framework. The benchmarking across multiple datasets adds some credibility to the claim that these metrics are more reliable for this purpose. The soft spot is the lack of any human evaluation. They assume the properties lead to better understanding without testing it directly—no studies on whether narrative-scoring explanations help users more than others. The benchmarks are only about separating narrative from descriptive text, which is a start but doesn't confirm the causal link to improved comprehension or trust. This is a common issue in XAI work where technical metrics get proposed without tying back to actual user benefits. This is for XAI folks interested in NLG for explanations. A reader working on evaluation metrics will find the proposals useful as a basis for further work. It deserves a serious referee because it identifies a gap and provides concrete tools, even if more empirical grounding is needed. I'd say send it for review with the expectation that revisions will add user studies or at least discuss how to validate the metrics against human judgments.

Referee Report

2 major / 2 minor

Summary. The paper argues that XAI explanations should be presented as narratives rather than static feature lists, drawing on four properties from social sciences and linguistics (continuous structure, cause-effect mechanisms, linguistic fluency, and lexical diversity) to support human understanding. It demonstrates that standard token-probability and word-frequency NLP metrics can be matched or exceeded by tautological text with no explanatory value, proposes seven automatic metrics to quantify the four narrative dimensions, benchmarks these metrics on six datasets against state-of-the-art explanation generators to show improved separation of narrative from descriptive text, and offers a set of problem-agnostic generation rules intended to produce explanations with stronger narrative properties.

Significance. If the proposed metrics prove to correlate with human comprehension outcomes, the work could supply a practical evaluation framework and generation guidelines that move XAI beyond feature-importance lists toward more usable textual explanations. The tautological-text counter-example usefully exposes a concrete limitation of probability-based baselines. The benchmarking results on six datasets provide initial evidence that the new metrics discriminate better than prior approaches, which is a methodological contribution even if downstream human validation is still required.

major comments (2)

[Abstract and §4] Abstract and §4 (Benchmarking): The claim that the four narrative properties are primary drivers of human understanding for XAI explanations rests on citations to external literature but is not tested with any human-subject experiments (comprehension tests, trust ratings, or decision-quality measures) comparing high- versus low-narrative explanations. The reported benchmarking only establishes that the metrics separate narrative from descriptive text more reliably than baselines; it does not establish that higher scores on the metrics predict improved human outcomes.
[§3] §3 (Metric definitions): The seven automatic metrics are introduced without derivation details, formal definitions, or error analysis. No sensitivity studies (e.g., robustness to explanation length, domain shift, or lexical variation) are reported, which is load-bearing for the assertion that these metrics reliably quantify the four narrative properties across the six datasets.

minor comments (2)

[Generation rules section] The generation rules in the final section would be easier to apply if accompanied by a short pseudocode or checklist format.
A table summarizing the seven metrics, their formulas, and the narrative dimension each targets would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the scope and presentation of our contributions. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract and §4] The claim that the four narrative properties are primary drivers of human understanding for XAI explanations rests on citations to external literature but is not tested with any human-subject experiments (comprehension tests, trust ratings, or decision-quality measures) comparing high- versus low-narrative explanations. The reported benchmarking only establishes that the metrics separate narrative from descriptive text more reliably than baselines; it does not establish that higher scores on the metrics predict improved human outcomes.

Authors: We agree that the manuscript motivates the four properties via citations to social science and linguistics literature rather than new human-subject studies, and that the benchmarking in §4 demonstrates improved separation of narrative from descriptive text (including via the tautological counter-example) without directly validating correlation to human comprehension outcomes. The paper does not claim such predictive validity for human results. In revision we will (i) update the abstract and §4 to explicitly qualify the claims as literature-grounded and (ii) add a limitations paragraph stating that direct human validation of the metrics remains future work. This is a partial revision because no new experiments are added. revision: partial
Referee: [§3] The seven automatic metrics are introduced without derivation details, formal definitions, or error analysis. No sensitivity studies (e.g., robustness to explanation length, domain shift, or lexical variation) are reported, which is load-bearing for the assertion that these metrics reliably quantify the four narrative properties across the six datasets.

Authors: We accept that §3 would be strengthened by additional formalization. In the revised version we will expand the section to supply (a) explicit mathematical definitions for each metric, (b) brief derivation notes linking each metric to its target narrative property, and (c) a new error-analysis subsection that includes sensitivity checks on explanation length, domain shift, and lexical variation performed on the existing six datasets. These additions address the referee’s concern without altering the core results. revision: yes

Circularity Check

0 steps flagged

No circularity: claims grounded in external literature with independent metrics and benchmarks

full rationale

The paper draws its four narrative properties (continuous structure, cause-effect mechanisms, linguistic fluency, lexical diversity) from cited social-science and linguistics sources rather than defining them in terms of its own outputs or metrics. It then proposes seven new automatic metrics and generation rules as original contributions, benchmarks them against standard NLP baselines on six external datasets, and demonstrates separation of narrative vs. descriptive text without any self-referential fitting, parameter renaming as prediction, or load-bearing self-citation chains. The derivation chain remains self-contained against external benchmarks and does not reduce any core claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central argument rests on domain assumptions imported from social sciences and linguistics about effective explanations; no free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption Narrative explanations support human understanding through continuous structure, cause-effect mechanisms, linguistic fluency, and lexical diversity.
Stated as drawn from insights in social sciences and linguistics.

pith-pipeline@v0.9.0 · 5555 in / 1190 out tokens · 52082 ms · 2026-05-10T05:17:27.903994+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 24 canonical work pages · 3 internal anchors

[1]

Explanation in artificial intelligence: Insights from the social sciences.Artificial intelligence, 267:1–38, 2019

Tim Miller. Explanation in artificial intelligence: Insights from the social sciences.Artificial intelligence, 267:1–38, 2019

2019
[2]

Tell me a story! Narrative-driven XAI with Large Language Models.Decision Support Systems, 191:114402, 2025

David Martens, James Hinns, Camille Dams, Mark Vergouwen, and Theodoros Evgeniou. Tell me a story! Narrative-driven XAI with Large Language Models.Decision Support Systems, 191:114402, 2025. ISSN 0167-9236. doi: https://doi.org/10.1016/j.dss.2025.114402. URL https://www.sciencedirect.com/science/article/pii/S016792362500003X

work page doi:10.1016/j.dss.2025.114402 2025
[3]

Explainable ai: Beware of inmates running the asylum or: How i learnt to stop worrying and love the social and behavioural sciences.arXiv preprint arXiv:1712.00547, 2017

Tim Miller, Piers Howe, and Liz Sonenberg. Explainable ai: Beware of inmates running the asylum or: How i learnt to stop worrying and love the social and behavioural sciences.arXiv preprint arXiv:1712.00547, 2017

work page arXiv 2017
[4]

Using narratives and storytelling to communicate science with nonexpert audiences.Proceedings of the national academy of sciences, 111(supplement_4):13614–13620, 2014

Michael F Dahlstrom. Using narratives and storytelling to communicate science with nonexpert audiences.Proceedings of the national academy of sciences, 111(supplement_4):13614–13620, 2014

2014
[5]

GraphXAIN: Narratives to Explain Graph Neural Networks

Mateusz Cedro and David Martens. GraphXAIN: Narratives to Explain Graph Neural Networks. InExplainable Artificial Intelligence, pages 91–114. Springer Nature Switzerland, 2025. ISBN 978-3-032-08327-2

2025
[6]

Explingo: Explaining ai predictions using large language models

Alexandra Zytek, Sara Pido, Sarah Alnegheimish, Laure Berti-Equille, and Kalyan Veera- machaneni. Explingo: Explaining ai predictions using large language models. In2024 IEEE International Conference on Big Data (BigData), pages 1197–1208. IEEE, 2024

2024
[7]

Explaining machine learning models with interactive natural language conversations using talktomodel.Nature Machine Intelligence, 5(8):873–883, 2023

Dylan Slack, Satyapriya Krishna, Himabindu Lakkaraju, and Sameer Singh. Explaining machine learning models with interactive natural language conversations using talktomodel.Nature Machine Intelligence, 5(8):873–883, 2023

2023
[8]

Explanation and scientific understanding.the Journal of Philosophy, 71(1): 5–19, 1974

Michael Friedman. Explanation and scientific understanding.the Journal of Philosophy, 71(1): 5–19, 1974

1974
[9]

A theoretical framework for narrative explanation in science.Science education, 89(4): 535–563, 2005

Stephen P Norris, Sandra M Guilbert, Martha L Smith, Shahram Hakimelahi, and Linda M Phillips. A theoretical framework for narrative explanation in science.Science education, 89(4): 535–563, 2005

2005
[10]

Harvard university press, 1986

Jerome S Bruner.Actual minds, possible worlds. Harvard university press, 1986

1986
[11]

A unified approach to interpreting model predictions

Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017

2017
[12]

Why should I trust you?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why should I trust you?" Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016

2016
[13]

Explaining data-driven document classifications.MIS quarterly, 38(1):73–100, 2014

David Martens and Foster Provost. Explaining data-driven document classifications.MIS quarterly, 38(1):73–100, 2014

2014
[14]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps.arXiv preprint arXiv:1312.6034, 2013. 21

work page Pith review arXiv 2013
[15]

R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient- based localization. In2017 IEEE International Conference on Computer Vision (ICCV), pages 618–626, 2017. doi: 10.1109/ICCV.2017.74

work page doi:10.1109/iccv.2017.74 2017
[16]

Gnnexplainer: Generating explanations for graph neural networks.Advances in neural information processing systems, 32, 2019

Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. Gnnexplainer: Generating explanations for graph neural networks.Advances in neural information processing systems, 32, 2019

2019
[17]

A survey on xai and natural language explanations.Information Processing & Management, 60 (1):103111, 2023

Erik Cambria, Lorenzo Malandri, Fabio Mercorio, Mario Mezzanzanica, and Navid Nobani. A survey on xai and natural language explanations.Information Processing & Management, 60 (1):103111, 2023

2023
[18]

A generative model for category text generation.Information Sciences, 450:301–315, 2018

Yang Li, Quan Pan, Suhang Wang, Tao Yang, and Erik Cambria. A generative model for category text generation.Information Sciences, 450:301–315, 2018. ISSN 0020-0255. doi: https://doi.org/10.1016/j.ins.2018.03.050. URL https://www.sciencedirect.com/scie nce/article/pii/S0020025518302366

work page doi:10.1016/j.ins.2018.03.050 2018
[19]

Graphnarrator: Generating textual explanations for graph neural networks

Bo Pan, Zhen Xiong, Guanchen Wu, Zheng Zhang, Yifei Zhang, Yuntong Hu, and Liang Zhao. Graphnarrator: Generating textual explanations for graph neural networks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23–42, 2025

2025
[20]

Zero-shot natural language explanations

Fawaz Sammani and Nikos Deligiannis. Zero-shot natural language explanations. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://open review.net/forum?id=X6VVK8pIzZ

2025
[21]

An agentic approach to generating XAI-Narratives.arXiv preprint arXiv:2603.20003, 2026

Yifan He and David Martens. An agentic approach to generating XAI-Narratives.arXiv preprint arXiv:2603.20003, 2026

work page arXiv 2026
[22]

FLamE: Few-shot learning from nat- ural language explanations

Yangqiaoyu Zhou, Yiming Zhang, and Chenhao Tan. FLamE: Few-shot learning from nat- ural language explanations. InProceedings of the 61st Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 6743–6763. Associa- tion for Computational Linguistics, 2023. doi: 10.18653/v1/2023.acl-long.372. URL https://aclanthology....

work page doi:10.18653/v1/2023.acl-long.372 2023
[23]

Towards harnessing natural language genera- tion to explain black-box models

Ettore Mariotti, Jose M Alonso, and Albert Gatt. Towards harnessing natural language genera- tion to explain black-box models. In2nd Workshop on interactive natural language technology for explainable artificial intelligence, pages 22–27, 2020

2020
[24]

Perplexity—a measure of the difficulty of speech recognition tasks.The Journal of the Acoustical Society of America, 62(S1): S63–S63, 1977

Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. Perplexity—a measure of the difficulty of speech recognition tasks.The Journal of the Acoustical Society of America, 62(S1): S63–S63, 1977

1977
[25]

How good is my story? towards quantitative metrics for evaluating LLM-generated XAI narratives,

Timour Ichmoukhamedov, James Hinns, and David Martens. How good is my story? To- wards quantitative metrics for evaluating LLM-generated XAI narratives.arXiv preprint arXiv:2412.10220, 2024

work page arXiv 2024
[26]

Rule2Text: A framework for generating and evaluating natural language explanations of knowledge graph rules,

Nasim Shirvani-Mahdavi and Chengkai Li. Rule2text: A framework for generating and evaluat- ing natural language explanations of knowledge graph rules.arXiv preprint arXiv:2508.10971, 2025

work page arXiv 2025
[27]

Perplexity from plm is unreliable for evaluating text quality.arXiv preprint arXiv:2210.05892,

Yequan Wang, Jiawen Deng, Aixin Sun, and Xuying Meng. Perplexity from plm is unreliable for evaluating text quality.arXiv preprint arXiv:2210.05892, 2022

work page arXiv 2022
[28]

Hashimoto, Hugh Zhang, and Percy Liang

Tatsunori B. Hashimoto, Hugh Zhang, and Percy Liang. Unifying human and statistical evaluation for natural language generation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1689–1701, Minneapolis, Minnesota,

2019
[29]

doi: 10.18653/v1/N19-1169

Association for Computational Linguistics. doi: 10.18653/v1/N19-1169

work page doi:10.18653/v1/n19-1169
[30]

The price of debiasing automatic metrics in natural language evaluation.arXiv preprint arXiv:1807.02202, 2018

Arun Tejasvi Chaganty, Stephen Mussman, and Percy Liang. The price of debiasing automatic metrics in natural language evaluation.arXiv preprint arXiv:1807.02202, 2018. 22

work page arXiv 2018
[31]

Perplexity cannot always tell right from wrong.arXiv preprint arXiv:2601.22950, 2026

Petar Veliˇckovi´c, Federico Barbero, Christos Perivolaropoulos, Simon Osindero, and Razvan Pascanu. Perplexity cannot always tell right from wrong.arXiv preprint arXiv:2601.22950, 2026

work page arXiv 2026
[32]

What is wrong with perplexity for long-context language modeling?arXiv preprint arXiv:2410.23771,

Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, and Yisen Wang. What is wrong with perplexity for long-context language modeling? arXiv preprint arXiv:2410.23771, 2024

work page arXiv 2024
[33]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

2002
[34]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

2005
[35]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

2004
[36]

Bleurt: Learning robust metrics for text generation

Thibault Sellam, Dipanjan Das, and Ankur Parikh. Bleurt: Learning robust metrics for text generation. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 7881–7892, 2020

2020
[37]

Language model evaluation beyond perplexity

Clara Meister and Ryan Cotterell. Language model evaluation beyond perplexity. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5328–5339. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021....

work page doi:10.18653/v1/2021.acl 2021
[38]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

2019
[39]

From feature importance to natural language explanations using llms with rag.arXiv preprint arXiv:2407.20990, 2024

Sule Tekkesinoglu and Lars Kunze. From feature importance to natural language explanations using llms with rag.arXiv preprint arXiv:2407.20990, 2024

work page arXiv 2024
[40]

Coh-metrix: Analysis of text on cohesion and language.Behavior research methods, instruments, & comput- ers, 36(2):193–202, 2004

Arthur C Graesser, Danielle S McNamara, Max M Louwerse, and Zhiqiang Cai. Coh-metrix: Analysis of text on cohesion and language.Behavior research methods, instruments, & comput- ers, 36(2):193–202, 2004

2004
[41]

Recollection versus imagination: Exploring human memory and cognition via neural language models

Maarten Sap, Eric Horvitz, Yejin Choi, Noah A Smith, and James Pennebaker. Recollection versus imagination: Exploring human memory and cognition via neural language models. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 1970–1978, 2020

1970
[42]

Graesser and Danielle S

Arthur C. Graesser and Danielle S. McNamara. Computational analyses of multilevel discourse comprehension.Topics in Cognitive Science, 3(2):371–398, 2011. doi: https://doi.org/10.1111/ j.1756-8765.2010.01081.x. URL https://onlinelibrary.wiley.com/doi/abs/10.111 1/j.1756-8765.2010.01081.x

work page arXiv 2011
[43]

Gandomi, Fang Chen, and Andreas Holzinger

Jianlong Zhou, Amir H. Gandomi, Fang Chen, and Andreas Holzinger. Evaluating the quality of machine learning explanations: A survey on methods and metrics.Electronics, 10(5), 2021. ISSN 2079-9292. doi: 10.3390/electronics10050593. URL https://www.mdpi.com/207 9-9292/10/5/593

work page doi:10.3390/electronics10050593 2021
[44]

Scientific explanation and the causal structure of the world

Wesley C Salmon. Scientific explanation and the causal structure of the world. 1984

1984
[45]

Four decades of scientific explanation.[part 3] the second decade (1958-67): Manifest destiny–expansion and conflict

Wesley C Salmon. Four decades of scientific explanation.[part 3] the second decade (1958-67): Manifest destiny–expansion and conflict. 1989

1958
[46]

Connectives and narrative text: The role of continuity.Memory & Cognition, 25(2):227–236, 1997

John D Murray. Connectives and narrative text: The role of continuity.Memory & Cognition, 25(2):227–236, 1997. 23

1997
[47]

Coh-metrix: Providing multilevel analyses of text characteristics.Educational researcher, 40(5):223–234, 2011

Arthur C Graesser, Danielle S McNamara, and Jonna M Kulikowich. Coh-metrix: Providing multilevel analyses of text characteristics.Educational researcher, 40(5):223–234, 2011

2011
[48]

Deictic shift theory and the poetics of involvement in narrative

Mary Galbraith. Deictic shift theory and the poetics of involvement in narrative. InDeixis in narrative, pages 19–59. Psychology Press, 1995

1995
[49]

Updating a situation model: a memory-based text processing view.Journal of Experimental Psychology: Learning, Memory, and Cognition, 24(5):1200, 1998

Edward J O’Brien, Michelle L Rizzella, Jason E Albrecht, and Jennifer G Halleran. Updating a situation model: a memory-based text processing view.Journal of Experimental Psychology: Learning, Memory, and Cognition, 24(5):1200, 1998

1998
[50]

Discovering explanations.Explanation and cognition, pages 21–59, 2000

Herbert A Simon. Discovering explanations.Explanation and cognition, pages 21–59, 2000

2000
[51]

Studies in the logic of explanation.Philosophy of science, 15(2):135–175, 1948

Carl G Hempel and Paul Oppenheim. Studies in the logic of explanation.Philosophy of science, 15(2):135–175, 1948

1948
[52]

The relative ease of writing narrative text

Ronald T Kellogg et al. The relative ease of writing narrative text. 1991

1991
[53]

Examining the reading difficulty of secondary students with learning disabilities: Expository versus narrative text.Remedial and Special Education, 23(1): 31–41, 2002

Laura M Saenz and Lynn S Fuchs. Examining the reading difficulty of secondary students with learning disabilities: Expository versus narrative text.Remedial and Special Education, 23(1): 31–41, 2002

2002
[54]

Expository text

Charles A Weaver III and Walter Kintsch. Expository text. 1991

1991
[55]

Prediction and entropy of printed english.Bell system technical journal, 30 (1):50–64, 1951

Claude E Shannon. Prediction and entropy of printed english.Bell system technical journal, 30 (1):50–64, 1951

1951
[56]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019

work page internal anchor Pith review arXiv 1904
[57]

Constructing a lexicon of english discourse connectives

Debopam Das, Tatjana Scheffler, Peter Bourgonje, and Manfred Stede. Constructing a lexicon of english discourse connectives. InProceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pages 360–365, 2018

2018
[58]

Scientific explanation and natural language: A unified epistemological-linguistic perspective for explainable ai.arXiv preprint arXiv:2205.01809, 2022

Marco Valentino and André Freitas. Scientific explanation and natural language: A unified epistemological-linguistic perspective for explainable ai.arXiv preprint arXiv:2205.01809, 2022

work page arXiv 2022
[59]

On the nature of explanation: An epistemological-linguistic perspective for explanation-based natural language inference.Philosophy & Technology, 37(3): 88, 2024

Marco Valentino and André Freitas. On the nature of explanation: An epistemological-linguistic perspective for explanation-based natural language inference.Philosophy & Technology, 37(3): 88, 2024

2024
[60]

Explanatory unification.Philosophy of Science, 48(4):507–531, 1981

Philip Kitcher. Explanatory unification.Philosophy of Science, 48(4):507–531, 1981. ISSN 00318248, 1539767X. URLhttp://www.jstor.org/stable/186834

1981
[61]

Free press New York, 1965

Carl G Hempel et al.Aspects of scientific explanation, volume 965. Free press New York, 1965

1965
[62]

Penn discourse treebank version 3.0.LDC2019T05, 2019

Rashmi Prasad, Bonnie Webber, Alan Lee, and Aravind Joshi. Penn discourse treebank version 3.0.LDC2019T05, 2019

2019
[63]

Improving neural language models with a continuous cache.arXiv preprint arXiv:1612.04426, 2016

Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache.arXiv preprint arXiv:1612.04426, 2016

work page arXiv 2016
[64]

Factuality enhanced language models for open-ended text generation

Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, and Bryan Catanzaro. Factuality enhanced language models for open-ended text generation. In Advances in Neural Information Processing Systems, volume 35, pages 34586–34599. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper /2022/file/df438ca...

2022
[65]

A diversity- promoting objective function for neural conversation models

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and William B Dolan. A diversity- promoting objective function for neural conversation models. InProceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 110–119, 2016. 24

2016
[66]

Understanding the dilemma of explainable artificial intelligence: a proposal for a ritual dialog framework.Humanities and Social Sciences Communications, 11 (1):1–9, 2024

Aorigele Bao and Yi Zeng. Understanding the dilemma of explainable artificial intelligence: a proposal for a ritual dialog framework.Humanities and Social Sciences Communications, 11 (1):1–9, 2024

2024
[67]

Random forests.Machine learning, 45(1):5–32, 2001

Leo Breiman. Random forests.Machine learning, 45(1):5–32, 2001

2001
[68]

Towards unifying evaluation of counterfactual explanations: Leveraging large language models for human-centric assessments

Marharyta Domnich, Julius Välja, Rasmus Moorits Veski, Giacomo Magnifico, Kadi Tulver, Eduard Barbu, and Raul Vicente. Towards unifying evaluation of counterfactual explanations: Leveraging large language models for human-centric assessments. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 16308–16316, 2025

2025
[69]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Statistical comparisons of classifiers over multiple data sets.Journal of Machine learning research, 7(Jan):1–30, 2006

Janez Demšar. Statistical comparisons of classifiers over multiple data sets.Journal of Machine learning research, 7(Jan):1–30, 2006

2006
[72]

Yamshchikov

Alexey Tikhonov, Igor Samenko, and Ivan P. Yamshchikov. StoryDB: Broad multi-language narrative dataset. In Yang Gao, Steffen Eger, Wei Zhao, Piyawat Lertvittayakumjorn, and Marina Fomicheva, editors,Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 32–39, Punta Cana, Dominican Republic, November 2021. Association for Comp...

work page doi:10.18653/v1/2021.eval4nlp-1.4 2021
[73]

Start the explanation immediately
[74]

Limit the entire answer to exactly {sentence_limit} sentences
[75]

Only mention the top {num_feat} most important features in the narrative
[76]

The goal is to have a narrative/story

Do not use tables or lists, or simply rattle through the features and/or nodes one by one. The goal is to have a narrative/story. **Content related rules**:
[77]

Be clear about what the model actually predicted for the {target_instance}
[78]

Make sure to clearly establish this the first time you refer to a feature

Discuss how the features contributed to final prediction. Make sure to clearly establish this the first time you refer to a feature
[79]

Consider the feature importance, feature values, and averages when referencing their relative importance
[80]

The reader should be able to tell what the order of importance of the features is based on their feature importance value

Begin the discussion of features by presenting those with the highest absolute feature importance values first. The reader should be able to tell what the order of importance of the features is based on their feature importance value

Showing first 80 references.